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2  Executive  Summary 


The  ability  to  autonomously  coordinate  a  team  of  agents  to  actively  collect  information  is 
critical  to  a  wide  array  of  Air  Force  missions.  With  computing  power  becoming  both  cheaper 
and  more  powerful,  there  is  a  trend  to  push  critical  decision  making  capabilities  “downstream” , 
towards  the  data  collection  nodes  rather  than  wait  for  data  to  arrive  to  a  massive  centralized 
location  before  a  decision  is  made.  This  new  computing  paradigm  relies  on  networked  agents 
to  actively  collect,  process  and  query  data  and  promises  to  significantly  improve  both  the 
quality/relevance  of  the  collected  data  and  the  associating  decision  making.  The  technological 
bottlenecks  for  such  a  computing  scheme  stem  from  a  lack  of  mathematics  and  algorithms  to 
manage  such  systems  rather  than  difficulties  associated  with  building  and  deploying  them. 


2.1  Objectives 

This  project  provides  a  comprehensive  solution  to  the  problem  of  intelligent  data  gathering  and 
decision  making  by  ensuring  that  the  information  collected  by  an  agent  has  the  most  “added 
value”  to  the  full  network.  The  three  specific  objectives  of  this  project  are  to: 

1.  Derive  the  system  properties  that  quantify  the  alignment  between  local  and  network 
utilities; 

2.  Derive  (and  update)  agent  utilities  under  communication  and  computation  restrictions 
that  will  lead  to  good  network  utilities;  and 

3.  Derive  agent  utilities  and  learning  strategies  for  agents  in  dynamic  and  stochastic  envi¬ 
ronments  with  “black  box”  network  utility  functions. 

2.2  Key  Contributions 

The  key  contribution  of  this  project  is  to  shift  the  focus  from  “how  to  optimize”  to  “what  to 
optimize”  in  difficult  coordination  problems.  The  impact  of  this  work  extends  to  a  large  class 
of  problems  relevant  to  the  Air  Force  including  satellite  communication  systems,  reconfigurable 
flight  control  systems,  sensor  networks,  and  intelligence  gathering  in  hybrid  networks. 

We  obtained  significant  results  supporting  all  three  objectives.  In  particular  we  have: 


•  Applied  system  characteristics  to  derive  agent  utility  functions  for  coordinating  informa¬ 
tion  gathering  robots.  Robots  using  such  utility  functions  to  learn  actions  significantly 
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outperform  other  robots  in  a  simulated  information  gathering  task  requiring  coordina¬ 
tion.  This  result  directly  supports  objective  1. 

•  Developed  a  new  and  fast  learning  algorithm  supporting  multiagent  coordination.  These 
results  are  based  on  learning  from  “actions  not  taken” ,  meaning  that  the  agents  update 
their  estimate  of  potential  actions’  outcomes  based  on  information  gathered  by  other 
agents. This  results  directly  supports  objective  2  by  improving  the  network  utility. 

•  Developed  agent  utilities  that  promote  team  formation  in  multiagent  systems  and  allow 
agents  to  achieve  good  values  of  network  utility  without  requiring  communication.  This 
results  directly  supports  objective  2.  Papers: 

•  Decomposed  network  utilities  into  components  to  allow  approximations  to  agent  utilities 
that  promote  coordination  in  the  presence  of  “black  box”  utility  functions.  This  results 
directly  supports  objective  3.  Papers: 

•  Achieved  coordinated  behavior  in  a  team  of  heterogeneous  agents  aiming  to  achieve  a 
high  level  utility.  This  objective  directly  supports  all  three  objectives  and  the  overall 
goals  of  the  proposal. 
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3  Publication  List 


Based  on  the  work  performed  on  this  project,  we  published  2  journal  articles  and  10  conference 
papers,  6  of  which  were  in  highly  refereed  conferences  with  acceptance  rates  below  50%.  In 
addition  one  journal  paper  was  recently  submitted  for  review. 
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(45%  acceptance).  Nominated  for  best  application  paper  award. 
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July  2010  (45%  acceptance) 

3.  M.  Salichon  and  K.  Turner.  A  Neuro-E volutionary  Approach  to  Micro  Aerial  Vehicle 
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Portland,  OR,  July  2010  (45%  acceptance).  Nominated  for  best  application  paper 
award. 
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Conference.  May  2010  (43%  acceptance) 

5.  M.  Knudson  and  K.  Turner.  Policy  Search  and  Policy  Gradient  Methods  for  Autonomous 
Navigation.  Proceedings  of  the  2010  Genetic  and  Evolutionary  Computation  Conference. 
Portland,  OR,  July  2010  (45%  acceptance) 
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2009: 


6.  N.  Khani  and  K.  Turner.  Learning  from  Actions  Not  Taken:  A  Multiagent  Learning  Al¬ 
gorithm  (extended  abstract)./n  Proceedings  of  the  Eighth  International  Joint  Conference 
on  Autonomous  Agents  and  MultiAgent  Systems.  Budapest,  Hungary,  May  2009.  (41% 
acceptance) . 

7.  K.  Turner  and  N.  Khani.  Learning  from  Actions  Not  Taken  in  Multiagent  Systems. 
Advances  in  Complex  Systems ,  Vol  12:455-473,  2009. 

8.  K.  Turner  and  A.  Agogino.  Multiagent  Learning  for  Black  Box  System  Reward  Functions. 
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4  Technical  Contributions 


The  key  technical  contributions  of  this  work  were  to  determine  how  to  provide  utilities  for 
individual  components  of  a  team  to  ensure  the  coordinated  and  efficient  behavior  of  the  full 
team,  to  that  end,  three  key  results  were  (i)  that  agents  can  learn  from  each  other  without 
explicitly  trying  each  alternative  action  ;  (ii)  agents  can  derive  local  utilities  from  “blackbox” 
network  utility  functions  ;  and  t  (iii)  agents  that  learn  together  can  achieve  tight  coordination 
if  their  utilities  are  properly  derived  based  on  the  characteristics  derived  in  objectives  1  and 
2.  The  attached  three  articles  provide  the  key  results  from  these  two  scientific  contributions 
of  this  work. 


4.1  Learning  from  Actions  not  Taken 

K.  Turner  and  N.  Khani.  Learning  from  Actions  Not  Taken  in  Multiagent  Systems.  Advances 
in  Complex  Systems ,  Vol  12:455-473,  2009. 


4.2  Learning  from  Blackbox  Utility  Functions 

K.  Turner  and  A.  Agogino.  Multiagent  Learning  for  Black  Box  System  Reward  Functions. 
Advances  in  Complex  Systems ,  Vol  12:475-492,  2009. 


4.3  Coordinating  Heterogeneous  Teams  of  Robots 

M.  Knudson  and  K.  Turner.  Coevolution  of  Heterogeneous  Multi-Robot  Teams.  Proceedings 
of  the  2010  Genetic  and  Evolutionary  Computation  Conference.  Portland,  OR,  July  2010. 
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In  large  cooperative  multiagent  systems,  coordinating  the  actions  of  the  agents  is  crit¬ 
ical  to  the  overall  system  achieving  its  intended  goal.  Even  when  the  agents  aim  to 
cooperate,  ensuring  that  the  agent  actions  lead  to  good  system  level  behavior  becomes 
increasingly  difficult  as  systems  become  larger.  One  of  the  fundamental  difficulties  in 
such  multiagent  systems  is  the  slow  learning  process  where  an  agent  not  only  needs  to 
learn  how  to  behave  in  a  complex  environment,  but  also  needs  to  account  for  the  actions 
of  other  learning  agents.  In  this  paper,  we  present  a  multiagent  learning  approach  that 
significantly  improves  the  learning  speed  in  multiagent  systems  by  allowing  an  agent 
to  update  its  estimate  of  the  rewards  (e.g.  value  function  in  reinforcement  learning)  for 
all  its  available  actions,  not  just  the  action  that  was  taken.  This  approach  is  based  on 
an  agent  estimating  the  counterfactual  reward  it  would  have  received  had  it  taken  a 
particular  action.  Our  results  show  that  the  rewards  on  such  “actions  not  taken”  are 
beneficial  early  in  training,  particularly  when  only  particular  “key”  actions  are  used.  We 
then  present  results  where  agent  teams  are  leveraged  to  estimate  those  rewards.  Finally, 
we  show  that  the  improved  learning  speed  is  critical  in  dynamic  environments  where 
fast  learning  is  critical  to  tracking  the  underlying  processes. 

Keywords:  Multiagent  learning;  counterfactual  reward;  difference  reward. 


1.  Introduction 

Learning  in  large  multiagent  systems  is  a  critical  area  of  research  with  applica¬ 
tions  ranging  from  robocup  soccer  [26,  27],  to  rover  coordination  [19],  to  trading 
agents  [25,  43],  to  air  traffic  management  [32].  What  makes  this  problem  partic¬ 
ularly  challenging  is  that  the  agents  in  the  system  provide  a  constantly  changing 
background  in  which  each  agent  needs  to  learn  its  task.  As  a  consequence,  almost 
by  definition,  all  multiagent  learning  occurs  in  complex  environments,  where  the 
agents  need  to  extract  the  underlying  reward  signal  from  the  noise  of  the  other 
agents  acting  within  the  same  environment. 

Furthermore,  typically,  two  learning  problems  are  coupled  where  the  agent  needs 
to  solve  both  a  temporal  credit  assignment  problem  (how  to  assign  a  reward  received 
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at  the  end  of  sequence  of  actions  to  each  action)  and  a  structural  credit  assignment 
problem  (how  to  assign  credit  to  a  particular  agent  at  the  end  of  a  multiagent 
task)  [1,  15,  16,  28,  38,  41,  44].  The  temporal  credit  assignment  problem  has  been 
extensively  studied  [10,  16,  28,  31,  30,  39,  42],  and  the  structural  credit  assignment 
problem  has  recently  been  investigated  as  well  [4,  8,  11,  20,  23,  35]. 

Learning  sequences  of  actions  for  multiagent  systems  has  blended  these  two 
areas  of  research  and  led  to  key  advances  [6,  8,  12,  26,  40].  In  these  cases,  the 
learning  needs  of  the  agents  are  modified  to  account  for  their  presence  in  a  larger 
system  [2,  11,  13,  22,  35,  37].  However,  though  these  methods  have  yielded  tremen¬ 
dous  advances  in  multiagent  learning,  they  are  principally  based  on  an  agent  trying 
an  action,  receiving  an  evaluation  of  that  action,  and  updating  its  own  estimate  on 
the  “value”  of  taking  that  action  in  that  state.  Though  effective,  such  an  approach 
is  generally  slow  to  converge,  particularly  in  large  and  dynamic  environments. 

In  this  paper,  we  explore  the  concept  of  agents  learning  from  actions  they  do 
not  take  by  estimating  the  rewards  they  would  have  received  had  they  taken  those 
actions.  These  counterfactual  rewards  are  estimated  using  the  theory  developed  for 
structural  credit  assignment,  and  prove  effective  in  the  congestion  games.  Further¬ 
more,  a  team  structure  can  be  used  to  provide  the  required  information  for  the 
agents  to  compute  these  reward  estimates  [24,  29].  A  key  benefit  of  this  approach 
is  that  an  increase  in  the  number  of  agents  can  be  leveraged  to  improve  the  esti¬ 
mates  of  actions  not  taken,  turning  a  potential  pitfall  (e.g.  how  to  extract  useful 
information  from  the  actions  of  so  many  agents)  into  an  asset  (e.g.  learn  from  the 
experiences  of  other  agents).  Though  the  concept  of  updating  rewards  for  actions 
not  taken  is  present  in  learning  automata  literature,  where  for  example,  the  prob¬ 
ability  of  taking  a  particular  action  may  go  up  down  based  on  similar  actions’ 
results  [21,  38,  39],  in  this  work  we  explicitly  aim  to  quantify  the  counterfactual 
concept  of  “what  would  my  reward  have  been ,  had  I  taken  another  action.'" 

In  Sec.  2,  we  discuss  the  congestion  problem  that  we  use  in  the  reported  exper¬ 
iments.  In  Sec.  3,  we  summarize  the  basic  agent  learning  architecture.  In  Sec.  4,  we 
provide  the  action-not-taken  (ANT)  rewards  and  modify  them  using  team  rewards. 
We  also  provide  experimental  results  showing  the  basic  behavior  of  the  ANT  reward. 
In  Sec.  5,  we  explore  the  application  of  these  rewards  to  dynamic  domains  where  the 
rapidly  changing  conditions  put  a  premium  on  learning  quickly.  Finally,  in  Sec.  6, 
we  discuss  the  results  and  provide  directions  for  future  research. 


2.  Congestion  Problems 

Congestion  problems  where  system  performance  depends  on  the  number  of  agents 
taking  a  particular  action  provide  an  interesting  domain  to  study  the  behavior 
of  cooperative  multiagent  systems.  In  congestion  problems,  agents  need  to  learn 
how  to  synchronize  (or  not  synchronize)  their  actions,  rather  than  learn  to  take 
particular  actions.  This  type  of  problem  is  ubiquitous  in  routing  domains  (e.g.  on 


Learning  from  Actions  Not  Taken  in  Multiagent  Systems  457 


a  highway,  a  particular  lane  is  not  preferable  to  any  other  lane,  but  what  matters 
is  how  many  others  are  using  a  particular  lane)  [18,  34]. 

The  multi-night  bar  problem  is  an  abstraction  of  congestion  games  (and  a 
variant  of  the  El  Farol  bar  problem  [5])  which  have  been  extensively  studied 
[1,  5,  9,  7,  14].  In  this  version  of  the  congestion  problem,  each  agent  has  to  determine 
which  day  in  the  week  to  attend  a  bar.  The  problem  is  set  up  so  that  if  either  too 
few  agents  attend  (boring  evening)  or  too  many  people  attend  (crowded  evening), 
the  total  enjoyment  of  the  attending  agents  drop. 

The  system  performance  is  quantified  by  a  system  reward  function  G.  This 
reward  is  a  function  of  the  full  system  state  z  (e.g.  the  joint  action  of  all  agents  in 
the  system),  and  is  given  by: 

n 

G(z)=  (1) 

day=l 

where  n  is  the  number  of  actions  (for  example  n  =  7  if  actions  are  days);  x&ay'.  the 
total  attendance  on  a  particular  day;  and  C:  a  real- valued  parameter  that  represents 
the  capacity  of  the  resource  (e.g.  the  capacity  of  the  bar). 

What  is  interesting  about  this  game  is  that  selfish  behavior  by  the  agents  tends 
to  lead  the  system  to  undesirable  states.  For  example,  if  all  agents  predict  an  empty 
bar,  they  will  all  attend  (poor  reward)  or  if  they  all  predict  a  crowded  bar,  none 
will  attend  (poor  reward).  This  aspect  of  the  bar  problem  is  what  makes  this  a 
“congestion  game”  and  an  abstract  model  of  many  real-world  problems  ranging 
from  lane  selection  in  traffic  to  job  scheduling  across  servers  to  data  routing. 

3.  Basic  Agent  Learning 

The  agent  actions  in  this  problem  is  to  select  a  resource  (day  on  which  to  attend 
the  bar).  The  learning  algorithm  for  each  agent  is  a  simple  reinforcement  learner 
(action  value) .  Each  agent  keeps  an  n-dimensional  vector  providing  its  estimates  of 
the  reward  it  would  receive  for  taking  each  possible  action.  The  system  dynamics 
are  given  by: 

Initialize:  week  0 

Repeat  until  week  >  Max  week 

1.  agents  choose  actions; 

2.  agents’  joint  action  leads  to  an  overall  system  state; 

3.  the  system  state  results  in  a  system  reward; 

4.  each  agent  receives  a  reward; 

5.  each  agent  updates  its  action  selection  procedure  (i.e.  learning); 

6.  week  <—  week+  1. 

In  any  week,  an  agent  estimates  its  expected  reward  for  attending  a  specific  night 
based  on  action  values  it  has  developed  in  previous  weeks.  At  the  beginning  of  each 
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training  run,  each  agent  has  an  equal  probability  of  choosing  each  action  in  the  first 
week,  resulting  in  a  uniformly  random  distribution  across  actions.  At  the  beginning 
of  each  training  week,  each  agent  picks  a  night  to  attend  based  on  sampling  this 
probability  vector  using  a  Gibbs  distribution.  Each  agent  has  n  actions  and  a  value 
Vk  associated  with  each  action  a*,: 


Pk 


e04-r) 


E 


agent 


eAVr)  ’ 


(2) 


where  r  is  a  temperature  term  that  determines  the  amount  of  exploration  (low 
values  of  r  mean  most  actions  have  similar  probabilities  of  being  selected,  whereas 
high  values  of  r  increase  the  probability  that  the  best  action  will  be  selected). 
Each  agent  receives  reward  R  and  updates  the  action  value  vector  using  a  value 
function  14: 


Vk  =  (1  -  a)  ■  Vk  +  a  •  R.  (3) 

A  reasonable  option  is  to  provide  each  agent  with  the  full  system  reward  for 
each  week.  This  leads  to  each  agent  receiving  the  reward  given  in  Eq.  (1),  and 
using  that  reward  to  update  its  value  estimates  for  each  action.  However,  this 
reward  is  not  particularly  sensitive  to  an  agent’s  actions  and  especially  in  large 
systems,  leads  to  particularly  slow  learning.  As  a  consequence,  in  this  work,  we  use 
the  difference  reward  as  a  starting  point  for  the  reward  an  agent  receives  after  each 
step.  Earlier  work  has  shown  that  the  difference  reward  significantly  outperforms 
both  agents  receiving  a  purely  local  reward  and  all  agents  receiving  the  same  system 
reward  [3,  2,  33,  32,  36].  The  difference  reward  is  given  by: 

Di(z)  =  G(z)-G(z-zi ),  (4) 

where  z  —  Zi  specifies  the  state  of  the  system  without  agent  i.a  In  this  instance  z  is 
the  full  attendance  profile  of  the  agents,  and  z  —  Zi  is  the  attendance  profile  of  all 
the  agents  without  agent  i.  Difference  rewards  are  aligned  with  the  system  reward, 
in  that  any  action  that  improves  the  difference  reward  will  also  improve  the  system 
reward.  This  is  because  the  second  term  on  the  right-hand  side  of  Eq.  (4)  does 
not  depend  on  agent  i’s  actions,  meaning  any  impact  agent  i  has  on  the  difference 
reward  is  through  the  first  term  (G)  [32,  35].  Furthermore,  it  is  more  sensitive  to 
the  actions  of  agent  i,  reflected  in  the  second  term  of  D ,  which  removes  the  effects 
of  other  agents  (i.e.  noise)  from  agent  i’s  reward  function. 

Intuitively,  this  causes  the  second  term  of  the  difference  reward  function  to 
evaluate  the  performance  of  the  system  without  i,  and  therefore  D  measures  the 
agent’s  contribution  to  the  system  reward  directly.  For  the  difference  reward  in  the 
congestion  problem,  this  amounts  to  having  each  agent  estimate  the  system  reward 
it  would  receive  were  it  to  take  or  not  take  a  particular  action.  In  this  work,  agents 


aIn  this  paper,  we  will  use  zero  padded  vector  addition  and  subtraction  to  specify  the  state 
dependence  on  specific  components  of  the  system. 
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do  not  explicitly  communicate  with  one  another,  and  therefore,  the  only  effect  each 
agent  has  on  the  system  is  to  increase  the  attendance,  cEday,  for  night  k  by  1.  This 
leads  to  the  following  difference  reward: 

D\z )  =  G(z)  -  G(z  -  zt) 

~xday^  s  —  (^dayj  — 

=  xdajy.e  o  -  (xday.  -  l)e  c  ,  (5) 

where  xda,Yi  is  the  total  attendance  on  the  day  selected  by  agent  i. 

4.  Action-Not-Taken  (ANT)  Rewards 

Though  the  difference  reward  given  in  Eq.  (5),  provides  a  reward  tuned  to  an  agent’s 
actions,  it  is  still  based  on  an  agent  sampling  each  of  its  actions  a  (potentially  large) 
number  of  times.  In  this  work,  in  order  to  increase  the  learning  speed,  we  introduce 
the  concept  of  ANT  rewards.17  The  goal  with  ANT  rewards  is  to  provide  estimates 
of  how  the  system  would  have  turned  out  had  an  agent  taken  a  particular  action. 
The  mathematics  that  allow  the  computation  of  the  difference  reward  can  be  used 
to  compute  this  type  of  reward. 

In  this  paper,  rather  than  have  a  separate  results  section,  we  provide  experimen¬ 
tal  results  directly  alongside  the  reward  descriptions  to  motivate  the  improvements 
to  the  rewards  and  the  derivation  of  new  rewards.  All  results  are  based  on  20  inde¬ 
pendent  runs  with  the  standard  error  plotted  when  large  enough  to  be  relevant. 
Unless  otherwise  specified  (as  with  the  scaling  runs  or  congestion  dependent  runs) 
the  number  of  agents  in  the  system  was  set  to  120,  with  C  =  6  (capacity),  and 
n  =  5  (number  of  actions,  or  days). 


4.1.  Basic  action-not-taken  reward 

The  direct  application  of  this  concept  is  to  have  agents  update  their  reward  esti¬ 
mate  based  on  the  reward  they  would  have  received  had  they  taken  other  actions. 
Therefore,  at  each  time  step,  agents  perform  a  mathematical  operation  that  sim¬ 
ulates  their  taking  a  different  action  and  compute  the  counterfactual  reward  that 
would  have  resulted  from  that  action.  For  an  agent  i  who  selected  action  a  at  this 
step,  the  counterfactual  reward  for  action  b  is  given  by: 

D^\z)  =  G{z  -  zt  +  zt)  -  G(z  -  zt),  (6) 

where  Dl^b  is  the  reward  for  agent  i  taking  action  b\  zt  is  the  state  component 
where  agent  i  has  taken  action  a;  z\  is  the  state  component  where  agent  i  has  taken 
action  b. 

The  second  term  of  Eq.  (6)  (G(z  —  zt))  is  the  same  as  the  second  term  of  Eq.  (4). 
Namely  the  reward  for  the  state  where  agent  i  has  not  taken  the  particular  action 
that  it  took.  The  first  term  though  is  the  key  to  the  ANT  reward.  In  this  case, 
we  compute  the  reward  that  would  have  resulted  had  agent  i  taken  action  b  rather 
than  action  a. 
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Utilizing  this  structure,  D\ NT  can  then  be  formulated  as  shown  in  Eq.  (7): 


n2  — 
^ANT  — 


G(z)  -  G(z  -  z“), 
G(z  -  z%  +  z\)  -  G(z 


for  i  — >  a, 

zf),  for  i  — >  b  ^  a, 


(7) 


where  i  — >  a  means  that  agent  *  has  taken  action  a.  Note,  the  removal  of  the  state  in 
which  agent  i  has  taken  action  a  in  the  second  term  represents  the  system  state 
without  agent  i.  Because  agent  i  had  taken  action  a,  this  removal  results  in  a  state 
where  agent  i  has  taken  neither  action  a  nor  action  b  (which  it  has  never  taken). 
Hence  the  second  term  is  the  same  for  both  conditions  of  Eq.  (7). 

Figure  1  shows  the  learning  curves  for  D  and  Gant  along  with  results  where 
agents  directly  use  the  system  reward  G  and  a  local  reward  L  to  learn.  The  local 
reward  L  is  based  on  the  agents  simply  receiving  the  reward  for  the  action  they 
took,  which  in  this  instance  is  the  component  of  Eq.  (1)  corresponding  to  the  day 
they  attended  the  bar  ( L  =  x^e  ).  This  is  a  “selfish”  reward,  in  that  the  agent 

is  only  concerned  with  the  day  on  which  it  decided  to  attend.  Though  yielding  poor 
results  in  this  case,  this  is  the  naive  decomposition  of  G  to  its  components  [45]. 

In  all  the  experiments,  the  system  performance  is  measured  with  respect  to  G, 
regardless  of  how  the  agents  were  trained.  As  previously  noted,  these  results  confirm 
that  agents  using  D  significantly  outperform  agents  using  G  or  L  in  this  domain.  G 
learns  little,  and  L  learns  to  do  the  wrong  thing:  Because  the  agent  rewards  are  not 
aligned,  agents  aiming  to  maximize  their  own  reward  lead  to  poor  system  states. 
We  include  the  results  for  agents  using  G  and  L  here  for  completeness,  but  we  will 
omit  them  in  subsequent  figures. 
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Training  Week 


Fig.  1.  System  performance  versus  training  weeks.  In  comparison  to  D,  D ant  based  on  actions 
not  taken  learns  faster  but  shows  a  lower  and  noisier  performance. 
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The  results  here  show  that  although  -Dant  learns  faster  than  D ,  it  struggles 
to  reach  good  solutions.  This  shows  that  the  ANT  reward  has  a  difficult  time 
estimating  the  reward  for  most  actions  once  those  actions  have  been  sampled.  Even 
though  the  agents  take  advantage  of  such  rewards  and  learn  faster  in  the  first  weeks 
of  training,  there  is  a  time  after  which  these  additional  rewards  become  detrimental 
to  the  learning  process.  This  suggests  two  possible  solutions,  which  we  explore  in 
the  next  two  sections: 

(1)  Use  the  ANT  reward  early  in  the  process,  but  stop  and  switch  to  basic  D  after 
a  “stop  week.” 

(2)  Select  only  a  subset  of  the  actions  to  receive  the  ANT  reward. 


4.2.  ANT  reward  with  early  stopping 

First,  let  us  consider  the  early  stopping  concept  to  mitigate  the  noisy  feedback 
agents  receive  for  their  actions.  This  modification  is  based  on  the  observation  that 
the  ANT  rewards  are  better  than  random  rewards,  but  not  as  good  as  rewards 
that  have  been  updated  by  actually  taking  the  actions.  Figure  2  shows  the  impact 
of  having  agents  use  ANT  rewards  for  the  first  6  weeks  and  then  switch  back  to 
using  D  (the  impact  of  when  to  stop  is  discussed  in  Fig.  3).  Results  show  that  this 
approach  significantly  speeds  up  the  learning  process,  though  does  not  result  in 
agents  reaching  higher  performance. 

Figure  3  shows  the  dependence  of  the  system  performance  on  the  length  of  time 
the  ANT  reward  is  used.  The  learning  speed  is  stable  for  small  values  of  the  stop 
week,  but  starts  to  drop  slowly  as  the  actions  not  taken  are  used  more  extensively. 
There  is  a  steady  rightward  shift  as  the  stop  week  moves  from  6  to  100,  at  which 


Fig.  2.  System  performance  when  actions  not  taken  are  stopped  after  week  6.  -Dant-es  (ANT 
with  Early  Stopping)  learns  faster  and  reaches  the  same  system  rewards  as  D. 
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Fig.  3.  The  impact  of  the  stop  week  on  system  performance.  The  learning  speed  is  directly  related 
to  the  length  of  time  the  action-not-taken  reward  is  used. 

point,  the  system  learns  more  slowly  than  D  alone.  Providing  a  mechanism  for 
selecting  the  stop  week  based  on  either  a  preset  number  of  ANT  rewards,  or  given 
performance  criteria  would  provide  automation,  though  in  this  work,  we  simply 
base  the  stop  week  on  trial  and  error  based  on  Fig.  3. 

4.3.  ANT  reward  with  teams 

The  second  option  we  consider  is  to  limit  the  actions  that  are  updated  based  on 
counterfactual  rewards  to  reliable  actions  sampled  by  a  subset  of  agents.  To  that 
end,  we  introduce  the  concept  of  a  team,  and  denote  agent  *’s  team  members  by 
Ti.  In  this  context,  Ti  is  a  fixed,  randomly  selected  subset  of  the  agents.  This 
formulation  gives: 

(  G(z)  —  G(z  —  z?),  for  i  — >  a, 

^ant -l  =  l  G(z  ~  zi  +  zi )  ~  G{z  —  £“),  for  i  — >  b  £  Ti,  (8) 

[0,  otherwise, 

where  i  —>  b  £  Ti  means  agent  i  selects  actions  b  that  are  sampled  by  agent’s  i' s 
teammates  T).  As  previously,  the  removal  of  agent  i  in  the  second  term  represents 
the  system  state  without  agent  i  having  taken  either  action  a  (which  it  had  taken) 
or  action  b  (which  it  had  not  taken),  leading  to  the  term  being  the  same  in  both 
cases. 

Figure  4  shows  the  results  when  an  agent  has  12  randomly  selected  team  mem¬ 
bers  (in  this  case  there  are  120  total  agents,  so  the  team  sizes  are  10%  of  the  total 
agents).  Other  than  at  the  extremes  (e.g.  team  size  of  2  or  110),  the  experiments 
were  not  particularly  sensitive  to  this  parameter.  By  limiting  the  number  of  actions 
that  are  updated  (-Dant-l  in  black/dark),  the  variability  of  the  reward  is  reduced  as 
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Fig.  4.  System  performance  when  only  a  subset  of  actions  are  explored  by  an  agent  (120  total 
agents,  team  size  of  12).  D  performs  well.  -Dant-l  based  on  a  limited  number  of  actions  not  taken 
performs  similarly  as  Dant  but  shows  a  response  with  a  lower  noise  level. 

compared  to  the  full  Dant  (in  green/light),  but  there  is  no  discernible  improvement 
in  the  quality  of  the  solution.  However,  from  a  computational  and  communication 
perspective,  this  is  an  interesting  result,  which  points  to  a  significant  reduction  in 
the  need  for  counterfactual  reward  computation  without  loss  of  convergence  speed. 

We  now  combine  the  two  concepts  and  have  agents  use  teams  and  early  stopping. 
Furthermore,  instead  of  using  the  team  members  as  information  sources  only,  we 
increase  the  connection  among  team  members  by  providing  them  all  with  the  same 
reward.  That  is,  all  team  members  attending  a  particular  day  will  receive  the  same 
reward.  The  learning  strategy  is  to  use  team  information  only  during  the  first  weeks 
(three  in  the  reported  results,  but  the  performance  is  similar  for  minor  changes  to 
this  parameter)  of  learning  and  switch  to  the  regular  difference  reward  [Eq.  (5)]  for 
the  rest  of  the  training  period. 

The  key  aspect  of  this  approach  is  that  the  team  members  measure  the  impact 
of  a  team  not  taking  a  particular  action,  rather  than  an  individual  agent.  As  a 
result,  agents  learn  with  their  team  in  a  smaller  state  space  defined  by  the  world 
minus  their  team  space  instead  of  the  entire  world.  This  is  conceptually  similar  to 
the  reward  described  in  Eq.  (8)  but  where  the  impact  of  the  whole  team,  rather 
than  agent  i  is  removed,  leading  to: 

(G(z)-G(z-z%t),  for  Ti  — >  a, 

^Team  =  l  G{z  -  z?  +  zfi  -  G(z  -  -  zf),  for  i  ^  b  €  T\  (9) 

[0,  otherwise, 

where  Zj~.  is  the  state  component  of  team  members  of  agent  i  taking  action  a.  In 
this  formulation,  the  impact  of  all  of  agent’s  i  teammates  are  removed  before  the 
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reward  is  calculated.  Note  in  this  case,  unlike  in  Eqs.  (7)  and  (8),  the  second  term 
is  different  for  the  two  actions.  This  is  because  this  term  estimates  the  impact  of 
removing  all  team  members  of  i  that  had  taken  a  particular  action.  When  agent  i 
changes  its  action,  this  also  changes  the  team  members  taking  the  same  action  as  i. 
For  the  action  a  selected  by  agent  i,  we  only  need  to  remove  all  its  team  members 
who  took  that  action.  But  to  find  the  counterfactual  reward  for  action  6,  we  need 
to  remove  the  actual  action  of  agent  i  (action  a)  and  then  remove  the  team  mem¬ 
bers  who  had  taken  action  b.  Though  conceptually  similar  to  previous  rewards,  the 
presence  of  team  members  leads  to  this  subtle  difference  in  the  computation  of  the 
team  ANT  reward. 

Now,  let  us  explicitly  compute  DlTe am  for  the  congestion  problem  considered  in 
this  paper.  First,  for  the  action  taken  by  agent  i  [first  line  of  Eq.  (9)],  the  reward 
becomes: 


G(z)  -G(z-z%.) 


Team 


(10) 


where  z—z?^.  is  the  state  component  in  which  agent  i  and  its  teammate  taking  action 
a  have  no  effect;  day^  is  the  day  agent  i  selects  to  attend;  :Eday.  is  the  attendance  on 
the  day  agent  i  selects  to  attend;  and  | |  is  the  number  of  agent  i' s  teammates 
that  choose  day,  to  attend. 

Second,  let  us  focus  on  the  actions  not  taken  by  agent  i  [second  line  of  Eq.  (9)]. 
This  is  the  reward  agent  i  would  have  received  had  it  taken  the  actions  b  chosen  by 
some  of  its  teammates,  leading  to: 


nz^b  - 

■^Team 


day 


(xday^,a 


day^dayi_0jb 


+  (^day^j,  +  l)e 


-Oday,-^l,+1) 

- r - 


(a:day^_>a  1) 

- C - - 


(^day^,,  +  l)e 


(xday?  h  +1) 


(11) 
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where  z  —  zf  +  z\  is  the  state  component  in  which  agent  i  takes  action  b  rather  than 
action  a;  z  —  z^.  —  zf  is  the  state  component  on  which  agent  i  (taking  action  a)  and 
its  teammates  taking  action  b  are  removed  from  the  state;  x^b  is  the  attendance 
resulting  from  agent  i  taking  action  6;  | |  is  the  number  of  agent  i’s  teammates 
that  choose  to  attend  on  day  resulting  from  action  b. 

In  this  formulation,  if  the  agent  i’s  team  members  have  taken  all  the  possible 
actions,  each  action  that  agent  i  had  not  taken  will  still  be  updated.  Otherwise, 
only  actions  taken  by  i’s  teammates  will  be  available  for  reward  information  and 
therefore  updated. 

Figure  5  shows  the  learning  curves  for  D  ,  -Dant-es  and  I?team-  Agents  using 
T^team  not  only  learn  faster,  but  also  reach  higher  system  rewards  than  agents  using 
the  baseline  D  or  previous  variants  of  Z?ant-  In  this  instance,  not  only  information 
from  team  members  was  used,  but  also  the  reward  of  each  team  member  was  the 
same,  resulting  in  a  larger  “block”  of  agents  receiving  a  reward,  and  removing  a 
significant  amount  of  noise  from  the  rewards. 


4.4.  ANT  reward  with  weighted  teams 

The  use  of  team  rewards  provided  tangible  benefits,  though  it  treated  all  informa¬ 
tion  received  from  team  members  equally.  Yet,  one  can  consider  that  the  more  team 
members  take  a  particular  action,  the  more  reliable  the  estimate  for  the  reward  of 


Training  Week 


Fig.  5.  System  performance  versus  training  weeks.  D  performs  well,  but  Dream  based  on  updating 
only  actions  that  were  taken  by  team  members  both  learns  faster  and  reaches  higher  system 
rewards  than  D  or  Dant-es- 
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that  action  would  become.  This  becomes  particularly  relevant  when  the  congestion 
in  the  system  increases. 

A  simple  solution  to  this  problem  is  to  use  a  weighting  factor  for  the  second  term 
of  the  counterfactual  reward  function.  In  this  work,  we  use  the  average  number  of 
team  members  selecting  particular  actions,  though  more  sophisticated  methods  can 
also  be  used.  This  leads  to  modifying  Eq.  (9),  that  for  agent  i  and  action  b  leads  to 
a  weighted  team  reward  -DvvE 

ZVt6  =  G(z  -  zf  +  z\)  -  hTLy^t  \  ■  G(z  —  zbTi  -  *?),  (12) 

where  U\Ti  i  is  the  average  number  of  team  members  taking  action  b. 

Figure  6  explores  this  idea  for  460  agents  in  a  system  with  seven  actions  and  a 
capacity  of  4.  Because  the  optimal  capacity  in  this  case  is  7  x  4  =  28,  this  creates 
significant  congestion.  The  results  show  that  traditional  D  starts  to  suffer  in  this 
case,  and  that  the  weighted  Dwt  outperforms  -Dteam-  Figure  7  shows  the  impact 
of  congestion  directly  as  the  number  of  agents  in  the  system  increases  from  120  to 
460.  Z?wt  handles  the  increased  congestion  better  than  either  -Dteam  or  D. 

5.  Tracking  Dynamic  Environments 

One  of  the  key  advantages  to  learning  rapidly  is  the  ability  to  adapt  to  dynamic 
environments  where  the  conditions  may  change  faster  than  a  traditional  learner  can 
adapt.  In  this  section,  we  test  the  performance  of  the  ANT  rewards  with  weighted 
team  reward  on  two  types  of  dynamic  environments.  First,  we  explore  seemingly 


Fig.  6.  System  performance  for  the  weighted  team  rewards.  There  are  460  agents  in  the  system 
with  only  seven  actions  of  capacity  4  leading  to  significant  congestion.  The  performance  of  -Dwt 
is  significantly  higher  than  either  the  base  D  or  -Dteam- 
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Number  of  Agents 

Fig.  7.  The  impact  of  congestion  on  system  performance  for  the  weighted  team  rewards.  The 
number  of  agents  increases,  but  the  capacity  of  each  day  stays  the  same  (C  =  4).  The  performances 
of  both  DTeam  and  D wt  are  significantly  higher  than  D ,  and  -Dwt  handles  the  congestion  the 
best. 

random  changes  in  agent  numbers  and  capacities,  and  then  we  explore  faster,  but 
periodic  changes  of  both  types. 

5.1.  Unpredictable  changes  to  the  environment 

In  this  section,  we  explore  the  ability  of  Hwt  to  adjust  to  unexpected  changes  in 
the  system.  Figure  8  shows  the  system  response  to  changes  in  the  number  of  agents. 
In  this  case,  the  number  of  agents  changed  every  40  weeks  from  280,  to  140,  to  180, 
to  100.  Dwt  not  only  recovers  rapidly,  but  also  learns  to  exploit  the  new  condition, 
as  demonstrated  at  week  120:  after  the  initial  drop  caused  by  the  change,  agents 
using  D  return  to  their  previous  state,  but  agents  using  Dwt  reach  a  higher  system 
reward  value. 

Figure  9  shows  the  system  response  to  the  capacity  changing  from  3  to  7  every 
70  weeks.  H\vt  learns  faster  early  on  and  reaches  slightly  higher  performance,  but 
this  experiment  shows  that  D  can  track  slow  changes  in  the  environment. 

5.2.  Periodic  changes  to  the  environment 

In  this  section,  we  explore  periodic  and  rapid  changes  to  the  environment.  Figure  10 
explores  the  performance  of  Dwt  versus  difference  reward  D  when  the  number  of 
agents  is  changing  rapidly.  Unlike  in  the  results  of  the  previous  section  (Fig.  8), 
D  has  a  hard  time  tracking  these  changes.  Dwt  on  the  other  hand  converges  to 
a  good  solution  despite  the  number  of  agents  in  the  system  changing  the  optimal 
solutions  for  each  agent.  (For  this  experiment,  we  modified  the  value  update  func¬ 
tion  to  account  for  the  periodicity  of  the  system,  and  allowed  the  value  update  to 
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Fig.  8.  System  performance  when  the  number  of  agents  in  the  system  changed  from  280,  140, 
180,  100  each  40  time  steps,  for  seven  actions  and  a  capacity  of  4.  Dwt  outperforms  D  both  in 
response  time  and  final  solution  quality. 


Fig.  9.  System  performance  when  the  capacity  of  the  system  changes  from  3  to  7  and  back  every 
70  time  steps  for  four  actions  and  120  agents. 

be:  T4  =  (1  —  a)  •  (r  •  Vfl~l  +  (1  —  t)  -V£  )  +  a  ■  R  where  t'  corresponds  to  the  last 
time  in  which  the  capacity  was  the  same.  This  value  can  be  estimated  in  practice, 
though  in  this  instance,  in  order  to  remove  the  impact  of  such  estimation  on  the 
reward  analysis,  we  provided  it  to  both  reward  functions.) 

Finally,  we  explore  the  impact  of  rapid  changes  to  the  system  capacity.  Figure  11 
shows  the  system  performance  when  the  capacity  oscillates  between  2  and  5.  Unlike 
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Fig.  10.  System  performance  versus  training  weeks.  There  were  eight  actions  with  a  capacity  of 
5.  The  standard  difference  reward  D  is  plotted  -Dwt  with  variations  of  60-120  in  the  number  of 
agents.  D  cannot  converge  to  a  good  solution,  but  .Dwt  not  only  converges  to  a  good  solution 
but  does  so  rapidly  after  each  capacity  change. 


Fig.  11.  System  performance  when  the  number  of  agents  changes  periodically.  There  were  eight 
actions  and  120  agents  and  capacity  changed  from  2  to  5  every  50  weeks.  D  performs  poorly,  but 
Dwt  learns  faster  and  reaches  higher  system  rewards  than  D  for  both  capacities. 
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in  Fig.  9,  D  cannot  track  this  continuous  change  as  it  does  not  get  sufficient  time 
to  learn  the  system  before  the  environment  changes.  Dwt,  however,  tracks  the 
changes.  Even  though  it  has  difficulties  with  the  rapid  changes,  it  both  reaches 
higher  system  level  performance  for  both  C  =  2  and  (7  =  5. 

6.  Discussion 

In  large  multiagent  systems,  the  agents  face  a  difficult  learning  problem  where 
their  actions  are  filtered  through  the  “group  action”  before  leading  to  a  reward. 
As  a  consequence,  an  agent  has  a  lengthy  learning  period  where  the  actions  need 
to  be  sampled  a  large  number  of  times  to  extract  the  “signal”  from  the  “noise.” 
The  use  of  the  difference  reward  provides  an  improvement  over  directly  using  the 
system  reward.  However,  a  standard  difference  reward  function  still  relies  on  each 
action  being  sampled  before  the  cleaned  up  reward  can  be  obtained.  In  this  work, 
we  present  a  modification  to  previously  used  difference  reward,  called  ANT  reward 
that  provides  agents  with  rewards  on  actions  that  were  not  taken  by  the  agent. 

We  then  provide  modified  versions  of  the  ANT  reward  that  through  early  stop¬ 
ping  and  team  structures  provides  improvements  in  both  the  learning  speed  and 
the  quality  of  the  solution  reached.  The  increase  in  speed  of  learning  is  the  direct 
result  of  an  agent  receiving  a  counterfactual  reward  that  estimates  the  reward  that 
agent  would  have  received  had  it  taken  a  particular  action.  Furthermore,  we  show 
that  the  performance  improvements  are  significantly  more  pronounced  in  dynamic 
environments  where  the  conditions  change  either  randomly  or  with  high  periodic¬ 
ity.  In  both  cases,  the  rapid  learning  allows  the  agents  to  track  a  highly  dynamic 
environment. 

Though  these  results  are  encouraging,  there  are  multiple  areas  for  further  inves¬ 
tigation  in  this  domain.  First,  the  communication  and  observation  requirements  of 
the  agents  can  be  explicitly  explored  and  connected  to  the  system  performance. 
Second,  having  agents  adopt  particular  roles  within  a  team  can  potentially  provide 
further  improvements  in  the  learning  speed.  Finally,  modifying  the  way  in  which 
agents  estimate  their  ANT  rewards  can  lead  to  substantial  computational  gains  in 
addition  to  the  already  achieved  speed  up  in  the  number  of  iterations  required  for 
convergence.  We  are  currently  investigating  all  three  extensions  of  this  work. 
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In  large,  distributed  systems  composed  of  adaptive  and  interactive  components  (agents), 
ensuring  the  coordination  among  the  agents  so  that  the  system  achieves  certain  perfor¬ 
mance  objectives  is  a  challenging  proposition.  The  key  difficulty  to  overcome  in  such 
systems  is  one  of  credit  assignment:  How  to  apportion  credit  (or  blame)  to  a  particu¬ 
lar  agent  based  on  the  performance  of  the  entire  system.  In  this  paper,  we  show  how 
this  problem  can  be  solved  in  general  for  a  large  class  of  reward  functions  whose  ana¬ 
lytical  form  may  be  unknown  (hence  “black  box”  reward).  This  method  combines  the 
salient  features  of  global  solutions  (e.g.  “team  games”)  which  are  broadly  applicable  but 
provide  poor  solutions  in  large  problems  with  those  of  local  solutions  (e.g.  “difference 
rewards”)  which  learn  quickly,  but  can  be  computationally  burdensome.  We  introduce 
two  estimates  for  local  rewards  for  a  class  of  problems  where  the  mapping  from  the 
agent  actions  to  system  reward  functions  can  be  decomposed  into  a  linear  combination 
of  nonlinear  functions  of  the  agents’  actions.  We  test  our  method’s  performance  on  a 
distributed  marketing  problem  and  an  air  traffic  flow  management  problem  and  show  a 
44%  performance  improvement  over  team  games  and  a  speedup  of  order  n  for  difference 
rewards  (for  an  n  agent  system). 

Keywords :  Multiagent  learning;  black  box  reward  functions;  multiagent  coordination. 


1.  Introduction 

The  ability  of  a  team  of  agents  to  learn  distributed  policies  has  been  demonstrated 
successfully  in  numerous  domains  such  as  controlling  multiple  robots,  aggregating 
information  from  distributed  data  sources,  and  distributed  system  administra¬ 
tion  [11,  14,  17,  27,  30].  While  diverse,  each  of  these  domains  share  two  important 
properties  fundamental  to  interesting  distributed  learning  problems:  (1)  each  agent 
learns  its  own  set  of  actions  (policy),  (2)  each  policy  is  trying  to  maximize  a  system 
reward  that  is  a  nonlinear  function  of  all  the  policies,  thus  coupling  the  policies 
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together.  This  type  of  problem  is  best  described  as  a  multiagent  learning  problem, 
where  each  agent,  i,  takes  an  action  2,;  and  tries  to  maximize  a  reward  function, 
G(z),  that  is  a  function  of  z,  the  actions  of  all  the  agents  [35,  33,  39,  36]. 

When  the  agent  actions  need  to  be  coordinated,  this  issue  becomes  particularly 
challenging  due  to  the  structural  credit  assignment  problem  [1,  22,  37,  38].  In  this 
problem,  credit  must  be  assigned  to  a  particular  agent  based  on  the  performance 
of  the  full  system.  For  example,  when  an  agent  takes  an  action  and  G  improves, 
the  agent  needs  to  determine  whether  its  action  was  (partly)  responsible  for  that 
improvement.  Though  lengthy  learning  trails  can  statistically  eliminate  the  impact 
of  other  agents  on  G,  such  an  approach  is  not  practical  for  large  systems.  If  G  is 
linearly  separable  in  the  agents’  actions,  this  credit  assignment  problem  is  trivial  as 
each  agent  can  maximize  its  own  separate  component  of  that  reward.  In  contrast, 
if  G  depends  on  all  the  agents’  actions  directly,  such  as  the  parity  problem,  finding 
an  adequate  distributed  solution  is  nearly  impossible,  and  the  problem  needs  to 
be  reformulated.  In  this  paper,  we  focus  on  problems  where  moderate  numbers 
of  agents  need  to  coordinate  their  actions  with  one  another  to  reach  satisfactory 
values  of  G. 

For  systems  with  few  agents,  this  credit  assignment  problem  can  be  sidestepped, 
and  all  agents  can  use  G  directly.  However,  when  the  number  of  agents  in  a  system 
increase,  this  method  breaks  down  and  agents  need  to  receive  a  reward  that  accounts 
for  their  contribution  to  the  system.  The  “difference  reward”  provides  such  a  reward, 
and  has  produced  good  results  in  many  domains  [5,  31,  33,  34].  However,  as  currently 
expressed,  the  difference  reward  requires  knowledge  of  the  functional  form  of  the 
system  reward. 

In  this  paper,  we  present  an  approach  that  lifts  this  requirement  using  two 
estimates  of  the  difference  reward  that  retains  its  fast  learning  characteristics,  but 
does  not  require  full  knowledge  of  the  functional  form  of  G(z)  .  In  Sec.  2,  we  briefly 
describe  the  related  work.  Section  3  describes  the  system  reward  structure  and  the 
basic  difference  used  in  multiagent  learning.  Section  4  derives  two  estimates  for 
the  difference  reward  that  allows  its  application  to  domains  with  G  of  unknown 
functional  form.  Section  5  presents  experimental  results  in  both  distributed  mar¬ 
keting  problem,  and  a  complex  air  traffic  flow  problem.  Section  6  discusses  the 
mathematical  implications  and  the  future  applications  of  the  estimated  difference 
rewards. 


2.  Related  Work 

In  general,  work  in  multiagent  learning  can  be  grouped  into  one  of  two  broad  cat¬ 
egories:  (i)  work  leveraging  domain  knowledge;  and  (ii)  general  work  applicable  to 
a  subset  of  the  domains.  Some  of  the  most  successful  work  in  multiagent  learning 
fall  into  the  first  category.  In  robotic  soccer  for  example,  player  specific  subtasks, 
followed  by  tiling  provide  good  convergence  properties  [27].  In  foraging  robot  coor¬ 
dination,  specific  rules  induce  good  division  of  labor  [19].  In  a  distributed  air  traffic 
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control  domain,  a  combination  of  positive  rewards  and  penalty  rewards  allows  a 
collection  of  aircrafts  to  navigate  safely  [15].  In  all  cases,  the  agent  coordination  is 
achieved  through  exploiting  knowledge  of  the  system  dynamics  and  accentuating 
the  known  desirable  interactions  among  the  agents. 

The  second  set  of  approaches  provide  general  solutions  to  a  subset  of  the  prob¬ 
lems.  Early  work  on  this  topic  focused  on  “team  games”  where  each  agent  considers 
itself  the  only  agent  in  the  system  and  receives  the  full  system  reward.  An  exam¬ 
ple  of  this  approach  is  the  control  of  four  elevators  where  a  separate  reinforcement 
learner  was  used  to  control  each  elevator,  and  each  learner  received  the  full  system 
reward  [10].  While  such  a  “team  game”  approach  is  effective,  it  is  restricted  to 
domains  with  a  small  number  of  agents.  In  problems  where  groups  of  agents  can 
be  assumed  to  be  independent,  the  task  can  be  decomposed  by  learning  a  set  of 
basis  functions  used  to  represent  the  value  function,  where  each  basis  only  pro¬ 
cesses  a  small  number  of  the  state  variables  [14].  Task  decomposition  has  also  been 
used  in  single  agent  RL  using  hierarchical  reinforcement  learning  methods  such  as 
MAXQ  value  function  decomposition  [12].  In  multiagent  learning,  Partially  Observ¬ 
able  Markov  Decision  Processes  (POMDPs)  can  be  simplified  through  piecewise 
linear  rewards  [24].  In  other  cases,  agents  can  be  assumed  to  be  locally  connected 
through  a  graph  and  can  learn  efficiently  through  local  rewards  [6] .  Outside  of  rein¬ 
forcement  learning,  mechanism  design  has  been  used  with  MDPs  to  address  the 
issue  of  creating  good  agent  incentives  for  specific  types  of  rewards  [25]. 

3.  Agent  and  System  Rewards 

As  stated  in  the  introduction,  in  this  paper  we  present  a  method  to  estimate  dif¬ 
ference  rewards  that  does  not  require  full  knowledge  of  the  functional  form  of  the 
system  reward  G. 

3.1.  System  reward 

In  particular,  we  focus  our  study  to  the  class  of  problems  where  system  reward  is 
in  the  form: 

G(z)  =  Gf(f(z))  =  Gf  ^  fiizA  ,  (1) 

where  Gf  is  a  known  nonlinear  function  and  the  /jS  are  unknown  nonlinear  func¬ 
tions.  Table  1  summarizes  the  functional  form  of  G  and  its  arguments. 


Table  1.  Functional  forms  for  system  objective  function. 


Function 

Form 

Argument 

G 

Unknown 

Nonlinear 

z 

Gf 

Known 

Nonlinear 

f 

f 

Known 

Linear 

fi 

fi 

Unknown 

Nonlinear 

Zi 
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The  key  assumption  in  this  work  is  that  the  fi  cannot  be  sampled  from  the 
domain,  but  that  _/)  can  be  sampled  (potentially  at  a  high  cost).  This  form 
of  G(z)  applies  to  a  large  number  of  domains  where  agents  have  an  unknown 
effect  on  their  environment  (fi)  and  these  effects  are  aggregated  together.  Such 
domains  include  air  (or  highway)  traffic  flow  management,  distributed  gating,  and 
distributed  information  gathering.  While  the  agents  do  not  know  the  fs  they  do 
know  how  these  aggregated  effects  contribute  to  the  system  goal  in  the  form  of 
Gf.  Our  estimate  exploits  this  structure  of  G  to  create  local  rewards  that  allow 
learning  to  proceed  significantly  faster  than  directly  using  G  and  be  applied  to  sys¬ 
tems  where  the  agent-specific  rewards  cannot  be  applied  because  the  form  of  G  is 
unknown. 


3.2.  Difference  reward 

In  a  multiagent  setting,  while  each  agent  can  try  to  maximize  the  system  reward 
directly,  such  an  approach  leads  to  slow/poor  learning  due  to  the  structural  credit 
assignment  problem.  An  alternative  is  to  have  each  agent  attempt  to  maximize 
an  agent-specific  reward  function  derived  in  such  a  way  that  if  agents  succeed 
in  maximizing  that  reward  function,  they  collectively  also  maximize  G.  One  such 
reward  function  is  the  difference  reward  function  of  the  form  [33] : 

Di  =  G(z)  —  G(z  —  Zi  +  Cj),  (2) 

where  z.-L  is  the  action  of  agent  i,  and  Cj  is  an  arbitrary  “action”  that  does  not 
depend  on  agent  i' s  actions.21  In  the  second  term  of  Di,  z  —  Zi  +  c,  represents  the 
“counterfactual”  states  where  the  action  of  agent  i,  Zi ,  is  replaced  by  a  fixed  action 
Ci  that  is  independent  of  the  agent’s  action. 

There  are  two  advantages  in  using  D:  First,  the  second  term,  G(z  —  Zi  +  c,), 
differs  from  the  first  term,  G(z),  only  in  the  actions  of  agent  i.  If  agent  i' s  action 
is  not  tightly  coupled  to  the  actions  of  the  other  agents,  then  the  second  term  will 
subtract  out  much  of  the  impact  of  the  actions  of  the  other  agents  in  the  system, 
therefore  providing  an  agent  with  a  “cleaner”  signal  than  G.  For  instance  if  all 
the  other  agents  choose  poor  actions,  the  impact  of  these  actions  would  appear 
in  both  terms  of  Di,  and  would  mostly  cancel  out.  This  benefit  has  been  dubbed 
“learnability”  (agents  have  an  easier  time  learning)  in  previous  work  [33].  Second, 
because  the  second  term  does  not  depend  on  the  actions  of  agent  i,  any  action 
taken  by  agent  i  that  improves  D,  also  improves  G.  Therefore,  we  expect  policies 
that  maximize  D  will  also  maximize  G.  This  specific  form  of  difference  reward  has 
been  effective  in  a  number  of  domains  including  congestion  problems,  multi-rover 
policy  evolution,  and  bin-packing  [2,  30,  33]. 

aThis  notation  uses  zero  padding  and  vector  addition  rather  than  concatenation  to  form  full  state 
vectors  from  partial  state  vectors. 
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As  an  example,  consider  the  application  of  this  reward  to  a  multi-robot  coordina¬ 
tion  problem  where  multiple  robots  need  to  gather  importance  weighted  information 
and  maximize  the  total  information  collected  by  all  the  robots  [5,  30].  In  such  a 
case,  selecting  a  c,;  that  removes  the  robots’  observations  from  the  system,  the  dif¬ 
ference  reward  measures  the  contribution  of  that  robot  to  the  system.  Note,  this 
is  not  equivalent  to  having  each  robot  simply  maximize  the  information  it  collects 
(which  leads  to  poor  system  behavior)  [5].  Instead,  the  difference  reward  leads  to 
robots  exploring  areas  that  would  not  have  been  explored  by  other  robots.  That  is, 
if  a  second  robot  would  have  observed  a  particular  area,  then  the  difference  reward 
provides  low  values,  urging  the  robot  to  find  information  with  more  value  to  the 
full  system  [5] . 

4.  Estimates  of  Difference  Rewards 

Though  providing  a  good  compromise  between  aiming  for  system  performance  and 
removing  the  impact  of  other  agents  from  an  agent’s  reward,  one  issue  that  may 
plague  D  is  computational  cost.  Because  it  relies  on  the  computation  of  the  coun- 
terfactual  term  G(z  —  Zi  +  Ci)  (i.e.  the  system  performance  without  agent  i)  it  may 
be  difficult  or  impossible  to  compute,  particularly  when  the  exact  mathematical 
form  of  G  is  not  known. 

For  reward  functions  that  are  of  the  form  given  in  Eq.  (1)  and  summarized  in 
Table  1,  however,  we  can  derive  estimates  for  D  that  overcome  this  limitation.  Our 
premise  is  that  we  can  sample  values  from  f(z),  enabling  us  to  compute  G,  but  that 
we  cannot  sample  from  each  fi(zi).  In  addition,  we  assume  we  may  not  be  able  to 
even  compute  f(z)  directly  and  must  sample  it  from  a  “black  box”  computation 
(e.g.  a  system  simulator)  or  measure  it  from  the  environment. 

4.1.  First  estimate 

The  key  element  in  the  computation  of  the  difference  reward  is  the  counterfactual 
G(z-Zi  +  Ci): 


G(z  -  Zi  +  =  Gf(f(z  -  Zi  +  c^) 


Gf(f(z)  -  fi(zi)  +  fi(ci))- 


(3) 


Unfortunately,  we  cannot  compute  this  directly  as  the  values  of  fi(zi)  are  unknown. 
However,  if  agents  take  actions  independently  (i.e.  they  do  not  observe  how  other 
agents  act  before  taking  their  own  actions)  we  can  take  advantage  of  the  linear  form 
of  f(z)  in  the  fiS  with  the  following  equality: 


E(f-i(Z-i)  I  Zi)  =  E(f_i(Z-i)  |  a) 


(4) 
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where  E(f-i(z-i )  |  Zi)  is  the  expected  value  of  fj^i  (all  / s  other  than  /*)  given  the 
value  of  Zi  and  E(f-i(z- ,)  |  c*)  is  the  expected  value  of  fj^i  given  Cj.  We  then  get 
the  following  estimate  for  f(z  —  Zi  +  Cj): 

f(z  -Zi  +  d )  =  f(z )  -  fi{Zi )  +  /i(Ci) 

=  /(*)  -  ~  E(f-i(z-i)\Zi) 

+  fi(d)  +  E(f—i  {z—i)  |cj) 

=  /OO  -  -E(/i(«*)ki)  -  £(/-»(*-»)  k») 

+  f;(/i(ci)|ci)  +  £(/_i(z_i)|ci) 

=  /W-S(/Wk)  +  ^(/(*)|ci).  (5) 

Therefore,  we  can  evaluate  Dj  =  G(z)  —  G(z  —  Zi  +  Ci)  as: 

or1  =  Gf(f(z ))  -  G,(/(*)  -  £(/(z)M  +  £(/(z)|Ci)). 

The  first  term  of  Dfstl  is  the  same  as  the  original  difference  reward.  The  second 
term  of  Z?fstl  tries  to  remove  the  impact  of  the  other  agents,  but  cannot  do  this 
as  elegantly  as  the  difference  reward  since  the  form  of  function  f(z)  is  not  known. 
Instead  of  subtracting  out  fi(zi)  and  adding  fi(ci)  directly,  we  estimate  this  by 
taking  the  difference  between  average  impact  of  action  z%  of  f(z)  and  the  average 
impact  of  action  Cj  on  f(x).  This  leaves  us  with  the  task  of  estimating  the  values  of 
E(f(z)\zi)  and  E(f(z)\ci)).  These  estimates  can  be  computed  by  keeping  a  table  of 
averages  where  we  average  the  values  of  the  observed  f(z)  for  each  value  of  Z{  that 
we  have  seen.  Note,  this  estimate  improves  as  the  number  of  samples  increases. 


4.2.  Second  estimate 

The  discussion  above  is  generally  applicable  to  any  selection  of  Cj.  We  can  improve 
this  estimate  if  we  set  c,  =  E(zi)  and  make  the  mean  squared  approximation  of 
fi(E(z ))  «  E(fi(z)).  The  last  expectation  in  D®stl  is  transformed  as  follows: 


E(f(z)\Ci)  =  E^f^) 


E{zi) 


=  E(fi(zi)\E(zi))  +  Y/E(f:(zj)) 

=  E(ME(zi))) +'£lE(fi(zj)) 

j¥=i 

~  E(E(fi(zi)))  +  '^E(fj(zj)) 


3 

=  E(f{z)). 


(6) 
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We  then  we  can  estimate  G(z)  —  G(z  —  Zi  +  Cj)  as: 

Dist2  =  Gf(f(z))  -  Gf(f(z)  -  E{f{z)\Zi)  +  E(f(z))). 

The  estimate  Z?®st2  is  the  same  as  D®stl,  except  that  E(f(z)\ci)  has  been  replaced 
with  E(f(z)).  This  formulation  has  two  advantages  over  D®stl:  First,  there  are  more 
samples  at  our  disposal  to  estimate  E(f(z))  than  we  do  to  estimate  E(f(z)\ci)). 
Second,  this  removes  the  need  to  select  a  value  for  c,.  Since  selecting  a  value  of 
Ci  that  will  lead  to  high  performance  can  be  difficult  in  some  domains,  it  can  be 
advantageous  to  have  this  parameter  removed. 


5.  Experimental  Results 

To  test  the  effectiveness  of  the  difference  reward  and  its  estimates,  we  conduct  a 
series  of  experiments  in  two  domains.  The  first  domain  is  an  illustrative  example  in 
the  form  of  a  distributed  marketing  problem,  where  separate  marketing  agents  try 
to  market  a  common  resource  to  distinct  groups  of  potential  customers.  The  second 
domain  tests  the  performance  of  our  reward  system  in  a  complex  air  traffic  flow 
domain,  where  we  use  the  Future  ATM  Concepts  Evaluation  Tool  (FACET)  air 
traffic  simulator  to  test  the  ability  of  learning  agents  to  create  policies  that  reduce 
congestion  while  minimizing  delays  [8] .  In  all  experiments,  we  test  the  performance 
of  five  different  methods.  The  first  method  is  Monte  Carlo  (MC)  estimation,  where 
random  policies  are  created,  with  the  best  policy  being  chosen.  The  other  four  meth¬ 
ods  are  based  on  reinforcement  learning  agents  where  the  agents  are  maximizing 
one  of  the  following  rewards: 

(1)  the  system  reward,  G(z); 

(2)  the  actual  difference  reward,  Dj(z)\ 

(3)  the  first  difference  reward  estimate,  D®stl(z);  and 

(4)  the  second  difference  reward  estimate,  Dfst2(z). 

In  these  experiments,  the  aim  of  each  agent  is  to  learn  to  take  actions  that 
will  lead  to  the  best  system  performance,  G.  To  form  policies,  each  agent  uses  an 
agent-specific  reward  function  and  tries  to  maximize  it  with  its  own  reinforcement 
learner  [20,  40]  (though  alternatives  such  as  evolving  neuro-controllers  are  also 
effective  [28,  30]).  To  clearly  illustrate  the  benefit  of  the  reward  estimates,  in  this 
paper  we  focus  on  domains  that  only  need  to  utilize  immediate  rewards.  As  a 
consequence,  simple  table-based  immediate  reward  reinforcement  learning  is  used. 
The  reinforcement  learner  is  equivalent  to  an  e-greedy  Q-learner  with  a  discount 
rate  of  0  [20].  In  all  the  experiments,  the  learning  rate  is  equal  to  0.5  and  e  is  equal 
to  0.25.  Note  that  in  many  domains,  reinforcement  learning  needs  to  look  at  rewards 
beyond  the  immediate  reward  and  address  a  temporal  credit  assignment  problem  of 
how  to  reward  a  current  action  for  a  sequence  of  future  rewards.  Difference  rewards 
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have  been  shown  to  address  both  the  structural  and  temporal  credit  assignment 
problems  for  domains  where  the  functional  form  of  G  is  known  [4]. 

To  make  the  agent  results  comparable  to  the  MC  estimation,  the  best  policies 
chosen  by  the  agents  over  a  single  trial  are  used  in  the  results.  MC  and  similar 
random  approaches  are  common  in  complex  air  traffic  problems  [21].  All  results  are 
an  average  of  30  independent  trials  with  the  differences  in  the  mean  (a/y/n)  shown 
as  error  bars,  though  in  most  cases  the  error  bars  are  too  small  to  see. 

5.1.  Distributed  marketing  problem 

The  first  domain  we  study  is  a  distributed  marketing  problem  where  a  number  of 
agents  need  to  choose  a  strategy,  and  their  reward  depends  on  the  strategies  of  all 
the  agents.  This  is  a  form  of  congestion  game  where  particular  joint  actions  lead  to 
desirable  or  undesirable  behavior  based  on  the  number  of  other  agents  that  have 
selected  that  particular  action  [4,  9,  16,  18,  32,  41]. 

5.1.1.  Problem  description 

In  the  “Marketing  Problem”  there  are  n  agents  marketing  a  constrained  resource 
to  n  different  demographics.  Examples  include  marketing  a  resort  hotel  to  several 
different  parts  of  the  country,  or  a  public  transportation  system  to  different  cities 
in  a  metropolitan  area.  In  this  problem,  each  agent  has  a  finite  set  of  marketing 
strategies.  The  action  of  agent,  i,  is  to  choose  a  strategy  Zt.  The  number  of  people 
who  use  the  resource  is  an  unknown  nonlinear  function  of  the  marketing  of  all  the 
agents,  f(z).  After  each  training  episode,  the  value  of  f(z)  is  measured.  We  assume 
that  the  marketers  are  targeting  disjoint  groups  so  f(z)  has  the  following  form: 
f(z)  =  fi{zi)>  where  fi(zi)  is  a  nonlinear  function  of  agent  i’s  marketing  action. 
The  function  f(z )  represents  the  aggregate  sum  of  the  effects  of  all  the  agents 
marketing  actions. 

This  form  represents  situations  where  marketers  do  not  have  a  model  for  the 
effects  of  their  marketing,  so  the  function  fi(zi)  is  unknown.  In  addition,  the  value  of 
fi(zi)  is  never  measured  as  we  only  have  measurements  of  the  aggregate  number 
of  people  using  the  resource,  f(z).  The  system  goal  is  to  have  the  optimal  amount 
of  people  use  the  resource.  More  revenue  is  gained  by  having  more  people  use  it, 
but  we  do  not  want  it  to  become  overused  as  that  will  hurt  our  reputation  (e.g. 
overuse  a  transportation  domain  would  result  in  congestion).  The  system  goal  is 
represented  by  a  known  nonlinear  function  of  the  total  number  of  people  using  the 
resource:  G(z )  =  Gf(f(z)).  Note  that  while  the  functional  form  of  Gf  is  known, 
the  form  of  G  (in  terms  of  z)  is  not,  since  f(z)  is  unknown. 

5.1.2.  Results 

We  conducted  a  series  of  experiments  where  agents  choose  one  of  M  marketing 
actions,  where  an  action  deterministically  resulted  in  zero  to  C  customers  using 
the  resource,  depending  on  the  action.  For  each  agent,  the  mapping  from  action  to 
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Fig.  1.  Marketing  experiment  with  100  agents.  The  estimates  for  D  perform  better  than  G , 
though  not  as  well  as  the  full  D ,  which  is  not  “computable”  in  many  real  world  domains. 


customer  response  fi(z)  was  chosen  at  random  at  the  start  of  the  experiment.  The 
function  Gf  was  set  to  ke~k^c  where  k  =  f(z)  is  the  aggregate  number  of  customers 
that  use  the  resource  and  c  is  the  optimal  capacity  for  the  resource. 

Figure  1  shows  the  performance  of  the  five  different  methods  in  a  marketing 
problem  with  100  agents,  where  M  —  10  and  C  =  18.  MC  optimization  provides  a 
baseline  solution.  Agents  using  G  directly  as  their  reward  perform  slightly  better 
than  MC.  But  in  this  case,  each  agent’s  reward  is  affected  by  the  actions  of  the 
other  99  other  agents,  making  it  hard  for  an  agent  to  discern  the  effects  of  its 
action  on  its  reward.  In  contrast  agents  using  the  true  difference  reward  learn  fast 
and  learn  well.  However,  this  reward  is  not  directly  computable  when  agents  do  not 
know  the  functional  form  of  f(z)  needed  to  compute  the  counterfactual  G/(/(z  — 
Zi  +  Ci)).  Therefore,  this  is  a  theoretical  result  that  cannot  be  implemented  in  many 
real  domains.  The  results  for  the  estimates  to  the  difference  rewards  (described  in 
Sec.  4),  where  an  agent  only  needs  to  know  the  value  of  f(z ),  are  promising  as  they 
outperform  agents  using  G. 

One  interesting  question  that  arises  concerns  the  convergence  properties  of 
agents  using  the  different  rewards.  Unfortunately  analysis  of  convergence  in  a  mul¬ 
tiagent  problem  is  difficult  given  nonlinear  interactions  between  the  agent  learning 
algorithms  and  the  reward  function.  In  theory,  the  difference  rewards  are  shown  to 
converge  to  (potentially  local)  minima  as  long  as  the  system  reward  converges  [33] . 
In  practice,  the  performance  of  the  agents  to  not  change  significantly  after  200 
learning  steps,  in  these  experiments. 

Figure  2  shows  the  scaling  results  for  the  number  of  agents  ranging  from  1  to  200. 
These  results  show  that  the  relative  performance  of  the  algorithms  is  not  affected 
by  the  number  of  agents.  Note  that  the  true  difference  reward  has  remarkably  good 
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Fig.  2.  Marketing  experiment  scaling  after  200  episodes.  The  estimates  for  D  degrade  more 
gracefully  than  G. 


scaling  characteristics  as  its  performance  does  not  degrade  as  the  number  of  agents 
is  increased  from  five  agents  to  200  agents,  making  it  a  good  choice  for  large  domains 
where  the  functional  form  of  G  is  known. 


5.1.3.  Computational  cost  of  D  and  Dest 

The  results  above  show  the  performance  of  the  different  algorithms  after  a  specific 
number  of  episodes,  demonstrating  that  D  performs  significantly  better  than  the 
other  algorithms.  In  domains  where  the  functional  form  of  G  is  known,  D  can  often 
be  computed  without  explicit  calls  to  G  [2].  However,  if  the  agents  are  unable  to 
streamline  their  computation  of  D,  agents  using  the  difference  reward  may  be  forced 
to  make  many  computations  of  G.  In  general,  for  n  agents,  that  means  D  gets  n 
times  as  many  G  function  calls.  Table  2  shows  the  relative  performance  for  a  given 
number  of  G  evaluations.  The  reward  D  performs  best  when  used  over  the  full  200 
episodes,  but  requires  4000  computations  of  G.  The  two  estimates  to  D  provide 


Table  2.  Marketing  experiment  with  100  agents,  after  200  G  evaluations 
(except  for  D4000  which  has  4000  G  evaluations  at  episode  200). 


Reward 

G 

cr/y/n 

Steps 

jje  st2 

94.2 

0.5 

200 

£)estl 

93.3 

0.5 

200 

D 

52.6 

0.8 

2 

£>4000 

110.4 

0.0003 

200 

G 

76.1 

0.7 

200 

MC 

62.6 

0.6 

200 
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the  best  compromise  between  performance  and  computational  cost,  outperforming 
both  D  and  G  for  a  given  number  of  G  evaluations.  Note  that  in  cases  where  G 
is  only  computed  by  sampling  the  environment,  it  may  not  even  be  possible  to 
compute  D  at  any  computational  cost  and  the  estimates  will  have  to  be  used  as 
discussed  in  Sec.  6. 

5.2.  Air  traffic  flow  problem 

The  second  domain  we  study  is  the  complex  domain  of  air  traffic  flow  manage¬ 
ment  [3,  7,  13,  23,  26,  29,  31].  This  is  a  complex  real  world  problem,  where  the 
agent  actions  cannot  be  directly  be  mapped  to  a  system  reward  in  analytical  form, 
creating  a  “black  box”  reward  for  the  learning  system. 


5.2.1.  Problem  description 

In  this  section,  we  summarize  how  distributed  learning  agents  can  learn  to  manage 
air  traffic  flow  [31].  First,  we  will  assign  agents  to  airspace  locations  called  “fixes”  to 
map  the  air  traffic  problem  to  a  multiagent  problem.  Each  agent  is  responsible  for 
any  aircraft  going  through  its  fix  [3,  31].  The  action  of  an  agent  is  to  determine  the 
separation  (distance  between  aircraft)  that  aircraft  have  to  maintain,  when  going 
through  the  agent’s  fix  (though  aircraft  will  always  keep  a  safe  distance,  ds,  if  d  is 
set  too  low) .  The  effect  of  issuing  higher  separation  values  is  to  slow  down  the  rate 
of  aircraft  that  go  through  the  fix.  By  increasing  the  value  of  d,  an  agent  can  limit 
the  amount  of  air  traffic  downstream  of  its  fix,  reducing  congestion  at  the  expense 
of  increasing  the  delays  upstream. 

Second,  we  will  use  FACET  (Future  ATM  Concepts  Evaluation  Tool,  where 
ATM  stands  for  Air  Traffic  Management)  to  simulate  air  traffic  and  determine 
the  impact  of  the  agents’  actions  [8].  FACET  simulates  air  traffic  based  on  flight 
plans  and  through  a  graphical  user  interface  allows  the  user  to  analyze  congestion 
patterns  of  different  sectors  and  centers  (Fig.  3).  FACET  also  allows  the  user  to 
change  the  flow  patterns  of  the  aircraft  through  a  number  of  mechanisms,  including 
“metering”  aircraft.  Metering  is  performed  by  choosing  a  “Miles  in  Trail”  (MIT) 
value,  which  specifies  the  minimum  distance  that  aircraft  may  be  spaced  from  each 
other  when  passing  through  a  particular  location.  Larger  MIT  values  cause  aircraft 
to  be  spaced  further  apart.  In  this  paper,  agents  send  scripts  to  FACET  asking  it 
to  simulate  air  traffic  based  on  metering  orders  imposed  by  the  agents.  The  agents 
then  produce  their  rewards  based  on  received  feedback  from  FACET  about  the 
impact  of  these  meterings. 

Finally,  we  will  define  a  system  reward  function  that  focuses  on  the  amount  of 
congestion  in  a  particular  sector  and  on  the  amount  of  measured  air  traffic  delay. 
This  is  measured  as  a  function  of  the  agents’  action  vector  z,  specifying  the  MIT 
values  chosen  by  the  agents.  More  precisely,  we  have: 

G(z)  =  -((l-a)B(z)+aC(z)), 


(7) 
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Fig.  3.  FACET  screen-shot  displaying  traffic  routes. 

where  B(z)  is  the  total  delay  penalty  for  all  aircraft  in  the  system,  and  C(z)  is  the 
total  congestion  penalty,  and  a  determines  the  relative  importance  of  these  two. 
Neither  B(z ),  nor  C(z)  can  be  analytically  computed  by  an  agent.  Rather,  they  are 
computed  after  the  number  of  aircraft  in  a  sector  are  computed. 

With  a  =  0.5,  for  the  two-congestion  problem  in  our  experiments  we  used  an 
instance  of  this  reward  function  described  in  detail  in  Ref.  31  and  summarized  as 
follows: 

G(z)  =-A1  -  r0M*  -Ti)-A2Y.Y,  -  Ci)e^-Ci\  (8) 

it  it 

where  t  is  time,  kt^  is  the  number  of  aircraft  in  congestion  i,  and  u(t)  is  the  unit  step 
function.  T)  is  the  delay  penalty  constant  (Tf  =  200  and  X2  =  175  here)  and  Ci  is 
the  congestion  penalty  constant  (C 1  =  18  and  C2  =  15  here).  A\  and  A2  are  scaling 
factors  for  the  delay  and  congestion  terms  (A\  =  \  and  A2  =  50),  and  (3  =  0.3. 
The  values  of  kt^  are  computed  by  FACET  and  are  affected  by  the  actions  of  the 
agents  as  described  in  the  following  section.  The  term  f3  is  a  user  defined  constant 
controlling  the  penalty  curve  for  congestion.  Note  that  G  cannot  be  expressed  in 
closed  form  in  terms  of  the  actions  of  the  agents,  since  the  effect  of  those  actions 
of  the  congestion  (ktj)  is  not  known  in  closed  form. 


5.2.2.  Results 

We  tested  the  performance  of  the  different  rewards  on  an  air  traffic  domain  with 
300  aircraft.  The  aircraft  go  through  two  points  of  congestion  over  a  four  hour 
simulation,  with  200  going  over  one  point  of  congestion  and  100  going  over  the 
other  point  of  congestion.  The  second  congestion  is  less  severe  than  the  first  one, 
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Fig.  4.  Performance  with  300  aircraft,  20  agents.  The  estimates  for  D  perform  better  than  G, 
though  not  as  well  as  the  full  D,  which  is  computationally  expensive  in  this  domain. 


so  agents  have  to  form  different  policies  depending  which  point  of  congestion  they 
are  influencing.  The  points  of  congestion  are  created  by  setting  up  a  series  of  flight 
plans  that  cause  the  number  of  aircraft  in  the  sectors  of  interest  to  be  significantly 
more  than  the  number  allowed  by  the  FA  A. 

The  results  displayed  in  Fig.  4  show  that  the  relative  performance  of  the  five 
methods  is  similar  to  the  Marketing  Problem.  However,  in  this  case  £)est2  performs 
better  than  Deatl.  This  is  caused  by  the  limited  amount  of  data  available  in  this 
domain  and  that  Dest2  draws  from  a  larger  sample  to  estimate  D1  resulting  in 
a  cleaner  signal.  Figure  5  shows  scaling  results  for  the  number  of  agents  varying 
from  10  to  50  and  shows  that  the  conclusions  are  not  sensitive  to  the  number  of 
agents.  Agents  using  Dest2  perform  slightly  better  than  agents  using  Destl  in  all 
cases  but  for  40  and  50  agents  where  they  are  statistically  equivalent.  While  adding 
more  fixes  increases  the  amount  of  control  the  agents  have  over  the  system,  this 
increase  does  not  necessarily  improve  performance.  The  main  issue  is  that  when  the 
number  of  fixes  grows  in  this  problem,  the  number  of  aircraft  going  through  each 
fix  decreases.  This  could  result  in  certain  fixes  in  superior  positions  to  control  less 
aircraft,  causing  a  reduction  in  performance. 

5.2.3.  Computational  cost  of  D  and  Dest 

As  was  the  case  for  the  Marketing  domain,  the  results  above  show  that  D  is  superior 
to  the  other  algorithms.  However,  in  the  air  traffic  domain,  D  can  only  be  computed 
with  additional  calls  to  the  FACET  simulator,  which  come  at  significant  compu¬ 
tational  cost.  The  computation  cost  of  the  system  reward,  G  [Eq.  (7)]  is  almost 
entirely  dependent  on  the  computation  of  the  airplane  counts  for  the  congestions 


488  K.  Turner  and  A.  Agogino 


Number  of  Agents 

Fig.  5.  Impact  of  number  of  agents  on  system  performance  with  300  aircraft.  Performance 
improves  with  higher  number  of  agents,  but  only  if  the  algorithms  and  agents  rewards  can  “extract” 
the  extra  information. 


Table  3.  System  performance  for  20  agents,  300  aircraft,  after  2100  G 
evaluations  (except  for  D44^  which  has  44,100  G  evaluations  at  step  2100). 


Reward 

G 

cr/y/n 

Steps 

j~)e  st2 

-232.5 

7.55 

2100 

jye  stl 

-234.4 

6.83 

2100 

D 

-277.0 

7.80 

100 

jj44K 

-219.9 

4.48 

2100 

G 

-412.6 

13.60 

2100 

MC 

-639.0 

16.40 

2100 

kt,  which  need  to  be  computed  using  FACET. b  Except  when  D  is  used,  the  values 
of  k  are  computed  once  per  episode.  However,  to  compute  the  counterfactual  term 
in  D ,  if  FACET  is  treated  as  a  “black  box,”  each  agent  has  to  compute  its  own 
values  of  k  for  their  counterfactual  resulting  in  n  + 1  computations  of  k  per  episode. 

Table  3  shows  the  performance  of  the  algorithms  after  2100  G  computations 
for  each  of  the  algorithms  for  the  simulations  presented  in  Fig.  4  where  there  were 
20  agents  and  two  congestions.  All  the  algorithms  except  the  fully  computed  D 
reach  2100  k  computations  at  time  step  2100.  D,  however,  computes  k  once  for 
the  system,  and  then  once  for  each  agent,  leading  to  21  computations  per  time 
step.  It  therefore  reaches  2100  computations  at  time  step  100.  We  also  show  the 
results  of  the  full  D  computation  at  t  =  2100,  which  needs  44,100  computations 


bIn  our  simulations  a  computation  from  FACET  took  900  milliseconds,  while  all  the  other  com¬ 
putation  for  all  20  agents  in  an  episode  took  a  combined  5  milliseconds. 
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of  k  as  D44K .  Although  D44A  provides  the  best  result  by  a  slight  margin,  it  is 
achieved  at  a  considerable  computational  cost.  Indeed,  the  performance  of  the  two  D 
estimates  is  remarkable  in  this  case  as  they  were  obtained  with  about  20  times  fewer 
computations  of  k.  Furthermore,  the  two  D  estimates,  significantly  outperform  the 
full  D  computation  for  a  given  number  of  computations  of  k  and  validate  the 
assumptions  made  in  Sec.  4.  This  shows  that  for  this  domain,  in  practice  it  is  more 
fruitful  to  perform  more  learning  steps  and  approximate  D ,  than  few  learning  steps 
with  full  D  computation  when  we  treat  FACET  as  a  black  box. 


6.  Discussion 

Learning  multiagent  policies  is  difficult  due  to  the  structural  credit  assignment 
problem  of  how  to  credit  an  action’s  contribution  to  a  system  reward,  which  is  a 
function  of  many  actions.  Furthermore,  the  mapping  from  agent  actions  to  system 
reward  cannot  always  be  computed  in  closed  form.  This  paper  proposes  to  address 
this  issue  using  an  estimate  to  a  “difference  reward”  where  agents  learn  using  an 
agent-centric  reward  that  promotes  coordination.  On  a  marketing  problem  and 
an  air  traffic  flow  problem,  experimental  results  show  that  our  method  provides  an 
improvement  in  performance  by  up  to  44%  over  team  games  and  difference  rewards 
(when  computational  cost  is  taken  into  account). 

Whether  the  difference  reward  or  its  estimate  should  be  used  depends  on  what 
is  known  about  the  functional  form  of  the  system  reward,  and  how  much  it  costs 
to  compute.  We  are  interested  in  three  main  types  of  system  reward: 

(1)  the  system  reward  has  a  functional  form  that  is  completely  known; 

(2)  the  system  reward  is  a  black  box  with  high  computational  costs;  or 

(3)  the  system  reward  is  sampled  from  the  environment,  where  we  cannot  demand 
samples  for  arbitrary  actions. 

The  air  traffic  flow  management  problem  is  an  instance  of  the  second  type  of 
problem,  since  the  FACET  simulator  can  be  used  as  a  black  box  to  retrieve  values  of 
f(z ),  but  at  an  extremely  high  computational  cost.  In  this  case,  agents  should  use 
the  estimate  of  the  difference  reward  to  save  computational  costs.  The  marketing 
problem  is  an  instance  of  the  third  type  because  agents  do  not  know  /)  (i.e.  they  do 
not  know  how  their  actions  affect  their  target  audience).  In  this  case,  the  agents  can 
only  count  the  aggregate  number  of  people  affected  by  all  the  agents,  so  they  must 
use  the  estimate  to  the  difference  reward,  since  the  true  difference  reward  cannot 
be  computed.  (Note  that  the  marketing  problem  would  be  the  first  type  of  problem 
if  the  values  of  /)  were  known  ahead  of  time  and  the  difference  reward  could  be 
computed  in  closed  form.  In  this  case,  the  true  difference  reward  can  be  used.) 

This  work  provided  the  groundwork  for  multiagent  learning  in  domains  where 
the  system  reward  is  not  known  in  closed  form.  There  are  three  promising  extensions 
of  this  work:  First,  the  manner  in  which  the  estimate  for  f(z)  used  in  the  difference 
rewards  is  computed  can  be  improved.  Currently,  we  use  simple  averaging,  though 
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using  data  aging  or  similarity  measure  to  provide  a  weighted  average  can  improve 
the  estimate.  Second,  the  functional  form  of  the  system  rewards  can  be  extended 
beyond  that  given  in  Eq.  (1),  and  use  more  general  machine  learning  methods 
to  estimate  the  difference  reward.  Third,  the  difference  reward  estimates  are  now 
restricted  by  the  form  of  G.  Blending  imperfect  models  of  the  environment  with  true 
samples  in  order  to  compute  the  difference  reward  would  increase  both  the  speed 
and  the  accuracy  of  the  estimates.  We  are  currently  investigating  all  three  avenues 
of  research  and  extending  the  application  domains  to  include  robotic  exploration 
and  more  realistic  forms  of  the  air  traffic  flow  problem  (including  the  role  of  human 
air  traffic  controllers). 
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ABSTRACT 

Evolving  multiple  robots  so  that  each  robot  acting  indepen¬ 
dently  can  contribute  to  the  maximization  of  a  system  level 
objective  presents  significant  scientific  challenges.  For  exam¬ 
ple,  evolving  multiple  robots  to  maximize  aggregate  infor¬ 
mation  in  exploration  domains  (e.g.,  planetary  exploration, 
search  and  rescue)  requires  coordination,  which  in  turn  re¬ 
quires  the  careful  design  of  the  evaluation  functions.  Ad¬ 
ditionally,  where  communication  among  robots  is  expensive 
(e.g.,  limited  power  or  computation),  the  coordination  must 
be  achieved  passively,  without  robots  explicitly  informing 
others  of  their  states/intended  actions.  Coevolving  robots 
in  these  situations  is  a  potential  solution  to  producing  co¬ 
ordinated  behavior,  where  the  robots  are  coupled  through 
their  evaluation  functions.  In  this  work,  we  investigate  co¬ 
evolution  in  three  types  of  domains:  (i)  where  precisely  n 
homogeneous  robots  need  to  perform  a  task;  (ii)  where  n 
is  the  optimal  number  of  homogeneous  robots  for  the  task; 
and  (iii)  where  n  is  the  optimal  number  of  heterogeneous 
robots  for  the  task.  Our  results  show  that  coevolving  robots 
with  evaluation  functions  that  are  locally  aligned  with  the 
system  evaluation  significantly  improve  performance  over 
robots  evolving  using  the  system  evaluation  function  di¬ 
rectly,  particularly  in  dynamic  environments. 

Categories  and  Subject  Descriptors 

1.2.6  [ AI] :  Learning 

General  Terms 

Algorithms,  Experimentation 

Keywords 

Robot  coordination;  Coevolution;  Team  Formation 

1.  INTRODUCTION 

Coordinating  multiple  robots  to  achieve  a  system- wide  ob¬ 
jective  in  an  unknown  and  dynamic  environment  is  critical 
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to  many  of  today’s  relevant  applications,  including  the  au¬ 
tonomous  exploration  of  planetary  surfaces  and  search  and 
rescue  in  disaster  response.  In  such  cases,  the  environment 
may  be  dangerous,  uninhabitable  to  humans  all  together,  or 
sufficiently  distant  from  central  control  that  response  times 
require  autonomous,  coordinated  behavior.  Evolutionary  al¬ 
gorithms  are  particularly  relevant  to  these  applications,  as 
solutions  to  robotic  behavior  in  such  complex  environments 
are  difficult  or  impossible  to  model. 

In  general,  most  multi-robot  tasks  can  be  broadly  catego¬ 
rized  into  [8]:  (i)  tasks  where  a  single  robot  can  accomplish 
the  task,  but  where  having  a  multi-robot  system  improves 
the  process  (for  example,  terrain  mapping  or  trash  collec¬ 
tion);  and  (ii)  tasks  where  multiple  robots  are  necessary  to 
achieve  a  task  (for  example  to  carry  an  object).  In  both 
cases,  coordination  requires  addressing  many  challenges  (low 
level  navigation,  high  level  decision  making,  inter-robot  co¬ 
ordination)  each  of  which  requires  some  degree  of  informa¬ 
tion  gathering  [17].  However,  in  the  first  case,  a  failure  of 
coordination  leads  to  inefficient  use  of  resources,  whereas 
in  the  second,  it  leads  to  a  complete  system  breakdown. 
Therefore,  a  delicate  balance  must  be  established  within  a 
robots’  behavior  such  that  coordination  is  achieved  without 
an  overly  strict  adherence  to  a  specific  coordination  proto¬ 
col.  Through  coevolution,  robots  are  given  the  freedom  to 
develop  their  own  protocols  to  benefit  the  system  objective. 

In  this  work,  we  focus  on  problems  of  the  second  type, 
and  investigate  the  robot  evaluation  functions  that  need  to 
be  derived  for  the  overall  system  to  achieve  high  levels  of  per¬ 
formance.  To  that  end,  we  investigate  the  use  of  difference 
evaluation  functions  to  promote  team  formation  [3].  Such 
evaluation  functions  have  previously  been  applied  to  multia¬ 
gent  coordination  problems  of  the  first  type  [1,  18].  The  key 
contribution  of  this  work  is  to  extend  those  results  to  coor¬ 
dination  problems  of  the  second  type  where  unless  tight  co¬ 
ordination  among  the  agents  is  established  and  maintained, 
the  tasks  cannot  be  accomplished.  We  develop  teams  within 
the  multi-robot  system  using  passive  means  (e.g.,  no  explicit 
coordination  directives)  through  the  coupling  of  the  robots’ 
evaluation  functions. 

The  application  domain  we  selected  is  a  distributed  infor¬ 
mation  gathering  problem.  First  we  explore  the  case  where 
unless  a  particular  point  of  interest  is  observed  by  n  robots, 
the  point  of  interest  is  not  considered  as  observed.  Second  we 
explore  the  case  where  there  is  an  optimal  number  of  robots 
( n )  that  need  to  observe  a  point  of  interest,  but  where  the 
system  receives  some  value  for  observations  by  teams  with 
other  than  n  members.  Finally,  we  construct  a  system  where 


the  individuals  are  of  differing  capabilities,  and  one  of  each 
type  is  needed  to  provide  optimal  behavior. 

In  Section  2  we  discuss  the  robot  exploration  problem.  In 
Section  3,  we  present  the  problem  requiring  team  formation. 
In  Section  4  we  present  the  problem  of  encouraging  rather 
than  requiring  team  formation,  and  in  Section  5  we  present 
heterogeneous  teams  with  robots  of  two  types.  Finally  in 
Section  6  we  discuss  the  implication  of  these  results  and 
highlight  future  research  directions. 

1.1  Related  Work 

Extending  single  robot  approaches  to  multi-robot  systems 
presents  difficulties  in  ensuring  that  the  robots  learn  a  par¬ 
ticular  task  beneficial  to  the  overall  system.  New  approaches 
that  are  particularly  well  suited  to  multi-robot  systems  in¬ 
clude  using  Markov  Decision  Processes  for  online  mechanism 
design  [15],  developing  new  reinforcement  learning  based 
algorithms  [4,  6,  9,  10],  devising  agent-specific  evaluation 
functions  [3],  and  domain  based  evolution  [5].  In  addition, 
forming  coalitions  for  purposes  of  reducing  search  costs  [11], 
employing  multilevel  learning  architectures  for  the  forma¬ 
tion  of  coalitionsl  [16],  and  market  based  approaches  [21] 
have  been  examined. 

The  use  of  evolutionary  algorithms  in  a  multiagent  domain 
is  attractive  due  to  the  complex,  non-Markovian  nature  of 
most  systems.  Coevolution  furthers  the  advantages  by  eval¬ 
uating  the  performance  of  individuals  based  on  the  interac¬ 
tions  with  others  within  the  system.  Coevolution  algorithms 
tend  to  favor  stability  over  optimality  however  [19],  finding 
stable  equilibria  in  agent  behavior.  One  method  used  to  al¬ 
leviate  this  tendency  is  biasing  the  evaluation  functions  such 
that  the  fitness  is  evaluated  on  the  most  beneficial  collabora¬ 
tive  agents  [13,  14].  The  work  in  this  paper  is  similar,  where 
the  most  beneficial  collaborators  are  those  robots  that  most 
closely  observe  a  Point  of  Interest,  evaluated  through  a  dif¬ 
ference  function.  In  addition,  cooperative  coevolution  was 
further  classified  by  defining  a  robustness  criterion,  demon¬ 
strated  on  a  set  of  standard  multiagent  problems  [20] .  An  in¬ 
teresting  further  extension  to  coevolution  encodes  individual 
agents  with  a  base  skill-set  [7],  preventing  coevolved  agents 
from  having  to  learn  the  same  thing  independently. 

2.  ROBOT  COORDINATION 

The  multi-robot  information  gathering  problem  we  inves¬ 
tigate  in  this  work  consists  of  a  set  of  robots  that  must  ob¬ 
serve  a  set  of  points  of  interest  (POIs)  within  a  given  time 
window  [3[.  The  POIs  have  different  importance  to  the  sys¬ 
tem,  and  each  observation  of  a  POI  yields  a  value  inversely 
related  to  the  distance  the  robot  is  from  the  POI.  In  ad¬ 
dition,  and  particular  to  the  work  presented  in  this  paper, 
multiple  observations  of  a  POI  are  either  required  (Section  3) 
or  highly  beneficial  (Section  4)  to  the  system  objective. 

2.1  Robot  Capabilities 

Each  robot  uses  an  evolutionary  algorithm  to  map  its  sen¬ 
sor  inputs  to  an  x,  y  translation  relative  to  the  current  po¬ 
sition  of  the  robot.  Each  robot  utilizes  a  two  layer  sigmoid 
activated  artificial  neural  network  to  perform  this  mapping. 

The  inputs  to  this  neural  network  are  four  POI  sensors 
(Equation  1)  and  four  robot  sensors  (Equation  2),  where 
Xq01  and  xfOBOT  provide  the  POI  and  robot  “richness” 
of  each  quadrant  q,  respectively,  Vj  and  Lj  are  the  value 
and  location  of  POI  j  respectively,  Li  is  the  location  of  the 


current  robot  i  and  #ji9  is  the  separation  in  radians  between 
the  POI  and  the  center  of  the  sensor  quadrant. 


The  two  outputs  indicate  the  velocity  of  the  robot  (in 
the  two  axes  parallel  and  perpendicular  to  the  current  robot 
heading).  The  weights  of  the  neural  network  are  adjusted 
through  an  evolutionary  search  algorithm  [3,  2]  for  ranking 
and  subsequently  locating  successful  networks  within  a  pop¬ 
ulation  [12,  3[.  The  algorithm  maintains  a  population  of 
ten  networks,  utilizes  mutation  to  modify  individuals,  and 
ranks  them  based  on  a  performance  metric  specific  to  the 
domain.  The  search  algorithm  used  is  shown  in  Figure  1 
which  displays  the  ranking  and  mutation  steps. 


Initialize  N  networks  at  T  =  0 
For  T  <  Tmax  Loop: 

1.  Pick  a  random  network  Ni  from  population 
With  probability  e:  Ncurrent  <—  Ni 

With  probability  1  —  e:  Ncurrent  < —  iV{,est 

2.  Mutate  Ncurrent  to  produce  N' 

3.  Control  robot  with  N'  for  next  episode 

4.  Rank  N'  based  on  performance 
(evaluation  function) 

5.  Replace  Nworst  with  N' 


Figure  1:  Evolutionary  Algorithm:  An  e-greedy  evo¬ 

lutionary  algorithm  to  determine  the  weights  of  the  neural 
networks.  See  text  body  for  definitions.  T  indexes  episodes, 
N  indexes  networks  with  appropriate  subscripts,  and  N  is  the 
modified  network  for  use  in  control  of  the  current  episode. 

In  this  domain,  mutation  (Step  2)  involves  adding  a  ran¬ 
domly  generated  number  to  every  weight  within  the  network. 
This  can  be  done  in  a  large  variety  of  ways,  however  it  is 
done  here  by  sampling  from  a  random  Cauchy  distribution 
where  the  samples  are  limited  to  the  continuous  range  [- 
10.0,10.0]  [3].  Ranking  of  the  network  performance  (Step  4) 
is  done  using  a  domain  specific  evaluation  function,  and  is 
discussed  in  the  following  section. 

2.2  Robot  Objectives 

In  these  experiments,  we  used  three  different  evaluation 
functions  [3]  to  determine  the  performance  of  the  robot:  the 
system  evaluation  function  which  rates  the  performance  of 
the  full  system;  a  local  evaluation  function  that  rates  the 
performance  of  a  “selfish”  robot;  and  a  difference  evaluation 
function  that  aims  to  capture  the  impact  of  a  robot  in  the 
multi-robot  system  [3] .  These  three  evaluation  functions  are: 

•  The  system  evaluation  reflects  the  performance  of  the 
full  system.  Though  robots  optimizing  this  evaluation 
function  guarantees  that  the  robots  all  work  toward 


the  same  purpose,  robots  have  a  difficult  time  discern¬ 
ing  their  impact  on  this  function,  particularly  as  the 
number  of  robots  in  the  system  increases. 

•  The  local  evaluation  reflects  the  performance  of  the 
robot  operating  alone  in  the  environment.  Each  robot 
is  rewarded  for  the  sum  of  the  POIs  it  alone  observed. 
If  the  robots  operate  independently,  optimizing  this 
evaluation  function  would  lead  to  good  system  behav¬ 
ior.  However,  if  the  robots  interact  frequently,  then 
each  robot  aiming  to  optimize  its  own  local  function 
may  lead  to  competitive  rather  than  cooperative  be¬ 
havior. 

•  The  difference  evaluation  reflects  the  impact  a  robot 
has  on  the  full  system  [3,  2].  By  removing  the  value 
of  the  system  evaluation  where  robot  i  is  inactive,  the 
difference  evaluation  computes  the  value  added  by  the 
observations  of  robot  i  alone.  Because  only  POIs  to 
which  robot  i  were  closest  need  this  difference  com¬ 
puted,  this  evaluation  function  is  “locally”  computable 
in  most  instances. 

Though  conceptually  the  same,  the  specifics  of  these  eval¬ 
uations  are  different  for  each  of  the  problems  described  in 
the  following  sections.  We  derive  those  specific  evaluation 
structures  and  present  the  experimental  results  below. 


3.  REQUIRING  TEAM  FORMATION 

In  the  first  problem  we  examine,  the  robots  need  to  form 
teams  to  perform  a  task  and  contribute  to  the  system  objec¬ 
tive.  In  this  problem,  a  POI  is  considered  observed  only  if  n 
robots  visit  that  POI  from  within  a  certain  observation  dis¬ 
tance.  Neither  the  robot,  nor  the  system  receive  any  value 
unless  multiple  observations  of  a  POI  occur.  This  problem 
formulation  ensures  that  the  problem  is  one  that  cannot  be 
solved  by  a  single  robot  and  that  the  team  formation  is  es¬ 
sential  to  the  completion  of  each  task. 


3.1  Problem  Definition 

To  formalize  this  problem,  let  us  first  focus  on  a  problem 
where  the  observations  of  the  two  robots  closest  to  a  POI 
are  tallied.  If  more  than  two  robots  visit  a  POI,  only  the 
observations  of  the  closest  two  are  considered  and  their  visit 
distances  are  averaged  in  the  computation  of  the  system 
evaluation  (G),  which  is  given  by: 


G(z) 


EEE 


Pi  Nfj  N?ik 

k  (Si,j  +  Si,k) 
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where  V.  is  the  value  of  the  ith  POI,  Sij  is  the  closest  dis¬ 
tance  between  jth  robot  and  the  ith  POI,  and  Nfj  and  N?k 
determine  whether  a  robot  was  within  the  observation  dis¬ 
tance  So  and  the  closest  or  second  closest  robot,  respectively, 
to  the  ith  POI: 
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particular  POI,  and  results  in: 

Pj  W  =  Er  if  Si,j  <  So  (6) 

Oi.j 
z  'J 

This  evaluation  promotes  selfish  behavior  only,  providing 
a  clear,  easy-to-learn  signal,  but  one  not  aligned  with  the 
system  objective  as  a  whole. 

Finally,  the  difference  evaluation  for  a  robot  aims  to  pro¬ 
vide  system-wide  beneficial  behavior,  while  remaining  sensi¬ 
tive  to  the  actions  of  a  robot  [3] .  This  difference  evaluation 
function  is  given  by: 
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where  l  is  the  third  closest  robot  to  POI  i  (meaning  that 
robots  j  and  k  are  the  closest  two  for  the  first  two  condition¬ 
als).  All  three  of  these  evaluations  were  applied  for  learn¬ 
ing  in  many  different  situations,  though  for  brevity,  only  an 
environment  with  50  POIs  and  40  robots  (which  was  repre¬ 
sentative  of  the  general  performance  of  the  evaluations)  is 
presented. 


Figure  2:  Sample  robot  paths  in  an  exploration  scenario. 

Multiple  observations  are  made  of  a  particular  point  of  in¬ 
terest.  In  the  team  formation  domain,  multiple  observations 
must  be  made  for  the  POI  to  have  any  value  to  the  system. 

Background  courtesy  of  JPL. 


Figure  2  shows  a  schematic  of  how  these  evaluation  func¬ 
tions  are  computed,  given  that  all  three  robots  are  within 
the  observation  radius.  Only  robots  1  and  2  (R.l  and  R2) 
are  taken  into  consideration  when  calculating  G(z)  because 
their  observation  distance  (5i,i  and  ^1,2)  are  closer  than  R3 
(£1,3).  For  G(z),  robot  3’s  observation  is  discarded.  For  the 
difference  evaluation  for  robots  1  or  2,  robot  3  is  taken  into 
consideration.  For  example,  in  calculating  Equation  7  for 
R2,  the  first  term  considers  R1  and  R2,  where  the  second 
term  considers  R1  and  R3.  That  is,  R2  receives  the  differ¬ 
ence  between  the  observation  values  of  R1  and  R2  and  the 
observation  values  of  R1  and  R3. 

3.2  Results 


The  single  robot  evaluation  function  used  by  each  robot 
only  focuses  on  the  value  a  robot  receives  for  observing  a 


The  environment  used  for  presentation  in  this  paper  con¬ 
tained  40  robots  and  50  POIs,  providing  a  great  deal  of 
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Figure  3:  Team  Formation  Required  Left:  System  evaluation  is  plotted  versus  episode  for  learning  in  an  environment  containing 

40  robots  and  50  POIs.  Right:  Maximum  evaluation  achieved  is  plotted  for  equal  numbers  of  robots  and  POIs.  Learning  is 
done  with  system,  local,  and  difference  evaluations  requiring  the  formation  of  teams  of  two  robots. 


information  to  be  gathered,  while  simultaneously  creating 
a  congested  situation.  In  addition,  the  environment  was 
highly  dynamic,  where  10%  of  the  POIs  (selected  randomly) 
changed  location  and  value  at  each  episode.  This  was  done 
to  encourage  specific  coordination  behavior  based  on  sen¬ 
sor  inputs  rather  than  specific  x-y  coordinates.  The  results 
are  based  on  2000  episodes  of  30  time-steps  each,  and  are 
averaged  for  significance. 

Figure  3  {left)  shows  that  robots  using  all  three  evalua¬ 
tions  perform  significantly  better  than  random  behavior.  It 
also  shows  that  the  difference  evaluation  provides  a  signal 
that  allows  the  robots  to  learn  to  coordinate  their  actions, 
whereas  using  the  system  and  local  evaluations  do  not.  Ad¬ 
ditionally,  Figure  3  {right)  shows  that  the  difference  evalu¬ 
ation  does  not  provide  benefits  until  the  system  reaches  the 
point  of  high  complexity. 

4.  ENCOURAGING  TEAM  FORMATION 

In  the  second  problem  we  examine,  multiple  robots  are 
encouraged  (rather  than  required)  to  form  teams  to  perform 
a  task  and  contribute  to  the  system  objective.  In  this  prob¬ 
lem,  a  POIs  value  is  optimized  for  n  robots  observing  it,  but 
the  system  receives  lesser  value  for  other  numbers  of  robots 
observing  the  POI.  Figure  4  shows  the  functional  form  of 
the  two  system  evaluations  used  in  Section  3  and  Section  4. 
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Figure  4:  POI  value  structure  is  compared  between  the 

required  (left)  and  encouraged  (right)  team  formation  sys¬ 
tems. 


4.1  Problem  Definition 

For  these  evaluations,  80  remains  the  same,  however  the 
distance  of  observation  is  no  longer  explicitly  included  in 
the  evaluation  function,  relying  on  inherent  inclusion  in  the 
observation  radius  of  the  POI.  As  before,  three  evaluation 
functions  are  defined,  beginning  with  the  system  evaluation 
given  by: 


G{z)  =  aVixe  P  (8) 

i 

where  i  indexes  POIs,  x  is  the  number  of  robots  within  <50, 
(3  is  the  observation  capacity,  and  a  is  a  constant  chosen 
to  be  1.37  such  that  the  maximum  of  the  exponential  curve 
approximates  the  POI  value  V). 

For  this  new  system  evaluation,  the  selfish  robot  evalua¬ 
tion  is  defined  as: 

p]  («)  =  aVi,ixe^  (9) 

where  indexing  and  constant  selection  is  the  same  as  above. 
This  evaluation  includes  no  information  regarding  contribu¬ 
tion  to  the  system  as  a  whole,  rather  indicating  only  what 
robot  j  can  directly  observe.  This  robot  evaluation  is  the 
component  of  the  system  objective  for  which  robot  j  was 
within  the  observation  distance  S0  of  each  POI. 

Finally,  the  difference  evaluation  function  for  this  system 
results  in: 


Do  (z)  =  aVi’i 


xe~P~  —  (x  —  1)  e 


(10) 


where  indexing  and  constant  selection  is  the  same  as  above. 
This  evaluation  aims  to  provide  the  contribution  of  robot 
j  to  the  system.  The  performance  of  all  three  evaluation 
functions  are  presented  in  the  next  section. 


4.2  Results 

All  training  parameters  were  maintained  from  those  used 
in  Section  3.2,  including  the  number  of  POIs  and  robots. 


Figure  5:  Team  Formation  Encouraged  Left:  System  evaluation  is  plotted  versus  episode  for  learning  in  an  environment 

containing  40  robots  and  50  POIs.  Right:  Maximum  evaluation  achieved  is  plotted  for  equal  numbers  of  robots  and  POIs. 
Learning  is  done  with  system,  local,  and  difference  evaluation  functions  requiring  the  formation  of  teams  of  two  robots. 


The  results  presented  in  Figure  5  are  qualitatively  similar 
to  those  seen  in  Figure  3.  This  is  a  good  result,  demonstrat¬ 
ing  that  the  team  requirement  in  general  is  applicable  and 
successful  for  multiple  formulations  of  the  problem  (does  not 
depend  on  the  exact  form  of  G).  As  before,  the  difference 
evaluation  provides  consistent  behavior  throughout,  where 
the  system  evaluation  function  (aligned  with  system,  but 
not  sensitive  to  a  given  robot’s  actions)  and  local  evaluation 
(sensitive  to  a  robot’s  action,  but  not  necessarily  aligned 
with  the  system  evaluation)  break  down. 

Here  again  Figure  5  (right)  shows  that  as  the  system  in¬ 
creases  in  complexity,  the  difference  evaluation,  through  pro¬ 
viding  a  better  learning  signal,  provides  consistent  behavior 
through  the  increased  complexity  of  the  system.  The  sys¬ 
tem  and  local  learning  evaluation  function  performance  ta¬ 
pers  off,  where  using  the  difference  evaluation  maintains  its’ 
performance  slope,  clearly  indicating  that  when  the  number 
of  robots  within  the  system  becomes  large,  the  difference 
evaluation  is  able  to  maintain  successful  dynamic  team  for¬ 
mation.  In  addition,  through  encouraging  team  formation, 
rather  than  requiring  it,  we  have  presented  a  simpler  prob¬ 
lem  to  learn. 

4.3  Higher  Coordination  Requirements 

The  previous  two  sections  investigated  coordination  for 
n  =  2,  for  both  required  and  encouraged  team  formation 
scenarios.  The  behavior  of  the  three  evaluation  functions 
was  similar  for  both  cases.  In  this  section  we  investigate 
the  behavior  for  n  =  3,  a  change  that  has  significant  impact 
on  the  computation  of  G,  particularly  when  the  observation 
distance  is  not  increased. 

Figure  6  (left)  shows  the  learning  results  for  requiring 
three  robots  to  observe  a  POI.  The  all-or-nothing  learning 
structure  in  this  evaluation  function  makes  it  very  difficult 
for  a  robot  using  passive  team  formation  to  extract  the  rel¬ 
evant  signal.  This  brings  the  difference  evaluation  closer  to 
the  system  objective  by  reducing  its  sensitivity  to  a  particu¬ 
lar  robot’s  actions  (that  is,  in  most  cases,  removing  a  robot 
from  the  system  has  no  impact  on  the  system  performance). 
As  a  consequence,  the  difference  evaluation  fails  to  promote 
good  system-level  behavior. 


By  contrast,  Figure  6  (right)  shows  the  behavior  of  the 
system  where  team  formation  is  encouraged  by  a  decaying 
value  assignment  to  POI  observations.  In  this  case,  moving 
from  n  =  2  to  n  =  3  does  not  affect  the  difference  evalua¬ 
tion.  This  is  because  in  this  problem,  removing  a  robot  has 
a  computable  impact  on  the  system  objective.  This  creates 
a  “gradient”  for  evaluating  the  impact  of  a  robot  on  the  sys¬ 
tem  as  a  whole.  As  a  consequence,  the  difference  evaluation 
performs  better  than  system  or  local  evaluation  functions. 

We  combine  the  conclusions  that  a)  encouraging  dynamic 
teams,  rather  than  requiring  them,  is  more  robust  to  changes 
in  system  definition  and,  b)  difference  evaluations  are  more 
successful  in  systems  changing  in  the  number  of  robots  and 
POIs  from  the  above  sections  to  formulate  a  problem  for 
heterogeneous  team  formation  in  the  following  section. 

5.  HETEROGENOUS  TEAM  FORMATION 

The  success  in  team  formation  shown  in  the  above  sections 
points  to  an  investigation  of  teams  constructed  of  heteroge¬ 
nous  robots.  When  the  entire  team  is  made  of  robots  of 
identical  construction,  the  tasks  are  limited  to  general  re¬ 
dundant  observations  of  an  environment  to  provide  robust¬ 
ness,  or  mechanical  tasks  that  require  multiple  individuals 
to  provide  enough  effort.  In  contrast,  if  the  individuals  can 
learn  to  dynamically  partner  with  one-another,  the  question 
arises  whether  or  not,  given  additional  sensing,  individuals 
of  differing  construction  can  partner  to  provide  a  more  spe¬ 
cific  suite  of  tasks. 

5.1  Problem  Definition 

In  the  final  problem  we  investigate,  we  define  two  robot 
types;  blue  and  green.  These  can  represent  any  number  of 
possible  construction  differences,  including  sensing  and  ar¬ 
ticulation,  depending  on  the  system  in  which  they  are  in¬ 
stalled.  The  individuals  must  have  the  ability  to  determine 
the  difference  between  the  two,  for  example  a  blue  robot 
must  be  able  to  determine  that  there  are  green  robots  else¬ 
where  in  the  environment.  In  addition,  the  evaluation  func¬ 
tion  must  again  be  modified  to  represent  the  need  for  robots 
of  differing  capabilities  to  visit  a  POI. 


Figure  6:  Higher  Coordination  Requirements  (n  =  3)  Left:  Required  Team  Formation.  Right:  Encouraged  Team  Formation. 

System  evaluation  is  plotted  versus  episode  for  learning  in  an  environment  containing  40  robots  and  50  POIs.  Learning  is  done 
with  system,  local,  and  difference  evaluation  functions  for  three  robots  to  observe  a  POI. 


The  sensing  capabilities  are  similar  to  those  shown  in  Sec¬ 
tion  2.1.  For  each  quadrant  q  however,  the  robot  sensor  is 
split  into  two,  one  indicating  the  density  of  “blue”  robots 
and  the  other  indicating  “green”  robots.  This  increases  the 
number  of  inputs  to  the  neural  network  from  8  to  12,  and 
the  number  of  hidden  units  was  increased  accordingly.  This 
configuration  maintains  comparability  to  homogeneous  ap¬ 
plications  while  providing  the  differentiation  between  robot 
types  needed  by  the  new  problem. 

We  showed  that  encouraging  team  formation  is  more  ben¬ 
eficial  to  the  learning  process  over  requiring  team  formation, 
and  therefore  the  modified  evaluation  function  reflects  the 
exponential  form  as  much  as  possible.  Again,  80  remains 
the  same,  and  the  functional  form  includes  the  number  of 
robots  in  the  observation  radius  of  a  given  POI.  The  number 
of  observations  however  is  separated  into  the  number  of  blue 
robots  and  green  robots  that  made  observations.  Therefore, 
the  optimal  solution  is  not  only  that  two  robots  visit,  but 
that  one  of  each  type  visits  each  POI. 

As  with  previous  work,  three  evaluation  functions  were 
defined  for  comparison,  reflecting  the  styles  discussed  in  Sec¬ 
tion  2.2.  Beginning  with  the  system-level  evaluation: 

G  (z)  =  ^^aViXbXge  &bPg  (11) 

i 

where  Xtype  is  the  number  of  observations  of  a  POI  i  of  each 
type  of  robot,  a  is  a  scaling  constant  to  ensure  the  maximum 
of  the  function  approximates  the  POI  value  V,  (set  to  2.72 
for  these  experiments),  and  f3x  are  the  constants  to  produce 
functional  peaks  at  the  desired  number  of  observations  of 
each  type  of  robot.  For  example,  to  have  one  of  each  type 
observe  a  POI,  /3b  =  (3g  =  1,  which  is  the  configuration  for 
subsequent  experiments. 

The  local  evaluation  is  similar  to  the  above,  however  it 
reflects  only  the  POIs  that  robot  j  has  visited.  Therefore  it 
is  locally  computable  and  easy  to  learn,  but  does  not  indicate 
the  robot’s  impact  on  the  system  as  a  whole: 


Pj  0)  =  0tVi,jXbxge  ^9  (12) 

i3 

where  indexing  and  constant  selection  is  the  same  as  the 
above. 

Finally,  the  difference  evaluation  includes  information  con¬ 
tained  in  the  system-level  evaluation,  but  is  easier  to  learn  as 
it  directly  indicates  how  robot  j  contributed  to  the  system 
as  a  whole.  It  is  contingent  on  the  type  of  robot  j: 

(13) 

where  indexing  and  constant  selection  is  the  same  as  above. 
The  equation  shown  is  for  robot  j  of  type  blue,  where  if  the 
type  is  green ,  1  is  subtracted  from  the  green  robot  obser¬ 
vations  rather  than  the  blue.  The  experimental  results  for 
the  use  of  all  three  evaluation  functions  follows  in  the  next 
section. 

5.2  Results 

The  domain  for  the  experiments  involving  heterogeneous 
teams  is  the  same  as  that  used  in  the  above  work.  Each  robot 
is  randomly  assigned  a  type  at  the  beginning  of  each  experi¬ 
ment  based  on  a  given  team  ratio.  Learning  time  is  adjusted 
from  2000  episodes  to  3000  as  the  network  has  increased 
in  size,  and  the  problem  has  increased  in  difficulty,  slightly 
decreasing  convergence  speed.  The  environment  maintains 
its  dynamic  nature,  where  10%  of  the  POIs  change  location 
and  value  at  every  episode,  though  the  robots  maintain  their 
type  throughout  the  learning  process. 

Figure  7  (left)  shows  the  results  of  training  in  an  environ¬ 
ment  where  40  robots  and  50  POIs  are  present.  The  ratio 
of  blue  to  green  robots  is  50%,  meaning  there  are  20  of 
each  type  present.  With  the  increased  problem  complexity 
we  observe  that  the  local  evaluation  is  entirely  incapable  of 
learning  a  good  solution,  in  fact  learning  the  wrong  thing, 
performing  worse  than  random  parameter  selection  (network 
weights)  after  convergence. 


Figure  T:  Heterogeneous  Team  Formation  Left:  System  performance  for  an  environment  containing  40  robots  and  50  POIs. 

Learning  is  done  with  system,  local,  and  difference  evaluation  functions  requiring  the  formation  of  teams  of  two  robots,  one  of 
each  type.  Right:  Maximum  performance  achieved  for  equal  numbers  of  robots  and  POIs.  Learning  is  done  with  system,  local, 
and  difference  evaluation  functions  encouraging  the  heterogeneous  formation  of  teams  of  two  robots. 


As  with  the  results  in  Section  4.2,  learning  with  the  system- 
level  evaluation  function  proves  difficult,  as  there  is  a  great 
deal  of  information  contained  in  the  signal;  too  much  regard¬ 
ing  other  robots  for  each  individual  to  ascertain  what  actions 
are  best  in  contributing  to  the  system  as  a  whole.  The  dif¬ 
ference  evaluation  however,  as  expected,  learns  quickly  and 
maintains  performance  through  the  learning  process.  This 
confirms  the  applicability  of  the  difference  evaluation  in  gen¬ 
eral,  and  specifically  indicates  that  dynamically  requiring 
heterogeneous  team  formation  in  a  congested  and  dynami¬ 
cally  changing  environment  is  achievable,  indeed  successful. 

We  next  examine  the  impact  of  increasing  both  the  num¬ 
ber  of  robots  and  the  number  of  POIs  within  the  system  si¬ 
multaneously.  Figure  7  (right)  shows  the  maximum  system- 
level  evaluation  function  achieved  for  varying  numbers  of 
robots  and  POIs  (where  the  number  of  robots  and  POIs  is 
the  same).  The  local  evaluation  begins  poorly  and  decreases 
further  as  the  system  complexity  increases,  as  shown  in  pre¬ 
vious  figures.  Using  the  system-level  evaluation  for  learning, 
while  increasing  slightly  as  complexity  increases,  is  strongly 
outperformed  by  the  difference  evaluation.  As  with  all  pre¬ 
vious  dynamic  team  formation  work  in  this  paper,  utiliza¬ 
tion  of  the  evaluation  function  significantly  improves  perfor¬ 
mance  over  the  others,  and  provides  an  excellent  learning 
signal  for  dynamic  team  formation,  particularly  in  domains 
absent  of  communication  and  heterogeneous  in  construction. 

In  varying  the  ratio  between  robot  types  present  in  the 
system,  we  can  determine  if  the  robots  are  able  to  modify 
their  behavior  to  suit  changes  in  system  consistently.  For 
example,  if  a  large  set  of  robots  of  a  specific  type  fail,  the 
system  must  have  the  ability  to  adjust  coordination  behavior 
to  maintain  success  in  accomplishing  the  tasks  requested. 
Figure  8  shows  the  maximum  system  performance  achieved 
when  the  ratio  between  blue  and  green  robots  is  varied.  The 
variance  is  symmetrical,  therefore  10%  blue  and  90%  green 
is  the  same  as  10%  green  and  90%  blue.  The  number  of 
robots  and  POIs  present  in  the  system  is  held  constant. 

The  local  evaluation  always  performs  poorly,  and  the  ratio 
of  types  within  the  system  has  little  impact  on  the  perfor¬ 
mance  of  the  system  evaluation.  This  points  to  a  lack  of 


Team  Ratio  (%) 

Figure  8:  Heterogeneous  Team  Ratios:  System  evalua¬ 

tion  is  plotted  versus  episode  for  learning  in  an  environment 
containing  40  robots  and  50  POIs.  Learning  is  done  with  sys¬ 
tem,  local,  and  difference  evaluation  functions  requiring  the 
formation  of  teams  of  two  robots,  one  of  each  type.  The  ratio 
between  blue  and  green  robots  varies  in  the  system. 

attention  paid  to  the  heterogeneous  nature  of  the  team  in 
the  behavior  of  the  robots  that  learn  with  the  system  evalu¬ 
ation.  The  difference  evaluation  however  varies  significantly 
when  the  teams  are  strongly  unbalanced,  particularly  when 
the  ratio  is  set  to  20%.  This  is  due  to  the  variance  in  sensing 
information  during  the  learning  process.  For  example,  when 
there  are  much  fewer  robots  of  one  type  within  the  system, 
the  sensors  detecting  the  two  types  return  significantly  dif¬ 
ferent  levels  of  information,  and  therefore  the  algorithm  can 
learn  to  focus  on  the  sensors  showing  where  robots  of  a  differ¬ 
ent  type  are  located.  This  provides  additional  information 
to  the  algorithm  regarding  the  actions  that  will  lead  directly 
to  an  increase  in  the  learning  evaluation  performance. 


6.  DISCUSSION  AND  FUTURE  WORK 

Exploration  of  planetary  surfaces  or  in  disaster  response 
requires  that  robotic  solutions  operate  in  unknown  and  dy¬ 
namic  environments.  Coordinating  multiple  robots  in  such 
domains  presents  additional  challenges.  In  this  work,  we 
explore  multi-robot  coordination  domains  where  multiple 
robots  are  necessary  to  achieve  a  task  (for  example  to  carry 
an  object).  We  focus  on  passive  coordination  that  is  accom¬ 
plished  through  the  robots’  evaluation  functions. 

The  work  presented  is  this  paper  explores  three  types  of 
problems  where  robot  coordination  is  beneficial.  First,  we 
explore  a  problem  where  n  robots  must  coordinate  to  receive 
a  reward.  Then,  we  explore  a  problem  where  the  system 
reward  is  optimized  for  n  robots,  but  other  number  of  robots 
observing  a  POI  also  contribute  to  the  system  objective. 
Finally  we  develop  a  heterogeneous  system  where  two  types 
of  robots  are  present,  and  an  observation  by  one  of  each 
produces  optimal  behavior. 

In  all  three  cases,  coordination  and  team  formation  is  es¬ 
tablished  and  maintained  through  passive  means  encoded  in 
the  robots  evaluation  functions.  The  difference  evaluation 
yielded  the  best  results  because  it  provided  an  evaluation 
that  was  aligned  with  the  overall  system  evaluation,  while 
maintaining  sensitivity  to  a  robot’s  actions,  even  when  many 
robots  were  active  within  the  coordinated  system.  That  ap¬ 
proach  also  extended  to  three  or  more  robots  encouraged  to 
complete  a  task.  This  is  an  interesting  result  showing  that 
the  difference  evaluation  is  best  suited  to  domains  where  the 
impact  of  a  robot  on  a  system  can  be  ascertained. 

We  are  currently  implementing  the  work  discussed  in  this 
paper  in  robot  hardware.  This  involves  investigating  non- 
episodic  learning  such  that  coordination  and  ad-hoc  team 
formation  can  be  learned  while  the  robot  is  in  current  oper¬ 
ation.  In  addition,  extensions  to  the  learning  algorithm  used 
in  this  paper  will  be  investigated  to  facilitate  the  restrictions 
of  physical  hardware. 
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